You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Anchit Jain <an...@gmail.com> on 2015/04/06 22:13:28 UTC
Nutch 1.9 integration with Solr 5.0.0
I want to index nutch results using *Solr 5.0* but as mentioned in
https://wiki.apache.org/nutch/NutchTutorial there is no directory
${APACHE_SOLR_HOME}/example/solr/collection1/conf/
in solr 5.0 . So where I have to copy *schema.xml*?
Also there is no *start.jar* present in example directory.
Re: Nutch 1.9 integration with Solr 5.0.0
Posted by Anchit Jain <an...@gmail.com>.
Currently I am not using any db.Data by nutch is stored in a directory.
Do you know what is the correct URL for posting data in SOLR in a core?
On Wednesday 08 April 2015 12:22 AM, yeshwanth kumar wrote:
> Which db you are using?
> If you have any time constraints for your task.
> Write a mapreduce job which reads from DB and then index into solr.
>
> I use crawl script for fetching parsing storing and indexing all at a time
> instead of doing them individually by nutch script.
>
> Error clearly specifies about url so do a curl or write a simple java code
> using http get post for reading and writing documents. This will give you
> some understanding whether the url is correct or not.
>
> Sent from mobile, please excuse any typographical errors.
> On Apr 7, 2015 12:00 PM, "Anchit Jain" <an...@gmail.com> wrote:
>
>> Same error :-( .
>>
>> So no workaround for the error?
>>
>> On Tuesday 07 April 2015 10:06 PM, Jeff Cocking wrote:
>>
>>> I use the following for all my indexing work.
>>>
>>> Usage: bin/crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds>
>>> Example: bin/crawl urls/ TestCrawl/ http://localhost:8983/solr/ 2
>>>
>>>
>>> On Tue, Apr 7, 2015 at 11:20 AM, Anchit Jain <an...@gmail.com>
>>> wrote:
>>>
>>> Yes it is working correctly from the browser. I can also manually add the
>>>> documents from web browser.But not through nutch.
>>>> I am not able to figure out where the problem is.
>>>>
>>>> Is there any manual way of adding crawldb and linkdb to the nutch besides
>>>> that command?
>>>>
>>>>
>>>> On Tuesday 07 April 2015 09:47 PM, Jeff Cocking wrote:
>>>>
>>>> There can be numerous reasons....Hosts.conf, firewall, etc. These are
>>>>> all
>>>>> unique to your system.
>>>>>
>>>>> Have you viewed the solr admin panel via a browser? This is a critical
>>>>> step in the installation. This validates SOLR can accept HTTP commands.
>>>>>
>>>>> On Tue, Apr 7, 2015 at 9:53 AM, Anchit Jain <an...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> I created a new core named *foo*. Than I copied the *schema.xml* from
>>>>>
>>>>>> *nutch* into *var/solr/data/foo/conf* with changes as described in*
>>>>>> https://wiki.apache.org/nutch/NutchTutorial*.
>>>>>> I changed the url to*http://localhost:8983/solr/#/foo*
>>>>>> so new command is
>>>>>> "*bin/nutch solrindex http://localhost:8983/solr/#/foo crawl/crawldb/
>>>>>> -linkdb crawl/linkdb/ crawl/segments/20150406231502/ -filter
>>>>>> -normalize*"
>>>>>> But now I am getting error
>>>>>> *org.apache.solr.common.SolrException: HTTP method POST is not
>>>>>> supported
>>>>>> by this URL*
>>>>>> *
>>>>>> *
>>>>>> Is some other change is also required in URL to support POST requests?
>>>>>>
>>>>>> Full log
>>>>>>
>>>>>> 2015-04-07 20:10:56,068 INFO indexer.IndexingJob - Indexer: starting
>>>>>> at
>>>>>> 2015-04-07 20:10:56
>>>>>> 2015-04-07 20:10:56,178 INFO indexer.IndexingJob - Indexer: deleting
>>>>>> gone
>>>>>> documents: false
>>>>>> 2015-04-07 20:10:56,178 INFO indexer.IndexingJob - Indexer: URL
>>>>>> filtering: true
>>>>>> 2015-04-07 20:10:56,178 INFO indexer.IndexingJob - Indexer: URL
>>>>>> normalizing: true
>>>>>> 2015-04-07 20:10:56,727 INFO indexer.IndexWriters - Adding
>>>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter
>>>>>> 2015-04-07 20:10:56,727 INFO indexer.IndexingJob - Active
>>>>>> IndexWriters :
>>>>>> SOLRIndexWriter
>>>>>> solr.server.url : URL of the SOLR instance (mandatory)
>>>>>> solr.commit.size : buffer size when sending to SOLR (default 1000)
>>>>>> solr.mapping.file : name of the mapping file for fields (default
>>>>>> solrindex-mapping.xml)
>>>>>> solr.auth : use authentication (default false)
>>>>>> solr.auth.username : use authentication (default false)
>>>>>> solr.auth : username for authentication
>>>>>> solr.auth.password : password for authentication
>>>>>>
>>>>>>
>>>>>> 2015-04-07 20:10:56,772 INFO indexer.IndexerMapReduce -
>>>>>> IndexerMapReduce:
>>>>>> crawldb: crawl/crawldb
>>>>>> 2015-04-07 20:10:56,772 INFO indexer.IndexerMapReduce -
>>>>>> IndexerMapReduce:
>>>>>> linkdb: crawl/linkdb
>>>>>> 2015-04-07 20:10:56,772 INFO indexer.IndexerMapReduce -
>>>>>> IndexerMapReduces: adding segment: crawl/segments/20150406231502
>>>>>> 2015-04-07 20:10:57,205 WARN util.NativeCodeLoader - Unable to load
>>>>>> native-hadoop library for your platform... using builtin-java classes
>>>>>> where
>>>>>> applicable
>>>>>> 2015-04-07 20:10:58,020 INFO anchor.AnchorIndexingFilter - Anchor
>>>>>> deduplication is: off
>>>>>> 2015-04-07 20:10:58,134 INFO regex.RegexURLNormalizer - can't find
>>>>>> rules
>>>>>> for scope 'indexer', using default
>>>>>> 2015-04-07 20:11:00,114 INFO regex.RegexURLNormalizer - can't find
>>>>>> rules
>>>>>> for scope 'indexer', using default
>>>>>> 2015-04-07 20:11:01,205 INFO regex.RegexURLNormalizer - can't find
>>>>>> rules
>>>>>> for scope 'indexer', using default
>>>>>> 2015-04-07 20:11:01,344 INFO regex.RegexURLNormalizer - can't find
>>>>>> rules
>>>>>> for scope 'indexer', using default
>>>>>> 2015-04-07 20:11:01,577 INFO regex.RegexURLNormalizer - can't find
>>>>>> rules
>>>>>> for scope 'indexer', using default
>>>>>> 2015-04-07 20:11:01,788 INFO regex.RegexURLNormalizer - can't find
>>>>>> rules
>>>>>> for scope 'indexer', using default
>>>>>> 2015-04-07 20:11:01,921 INFO indexer.IndexWriters - Adding
>>>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter
>>>>>> 2015-04-07 20:11:01,986 INFO solr.SolrMappingReader - source: content
>>>>>> dest: content
>>>>>> 2015-04-07 20:11:01,986 INFO solr.SolrMappingReader - source: title
>>>>>> dest:
>>>>>> title
>>>>>> 2015-04-07 20:11:01,986 INFO solr.SolrMappingReader - source: host
>>>>>> dest:
>>>>>> host
>>>>>> 2015-04-07 20:11:01,986 INFO solr.SolrMappingReader - source: segment
>>>>>> dest: segment
>>>>>> 2015-04-07 20:11:01,987 INFO solr.SolrMappingReader - source: boost
>>>>>> dest:
>>>>>> boost
>>>>>> 2015-04-07 20:11:01,987 INFO solr.SolrMappingReader - source: digest
>>>>>> dest: digest
>>>>>> 2015-04-07 20:11:01,987 INFO solr.SolrMappingReader - source: tstamp
>>>>>> dest: tstamp
>>>>>> 2015-04-07 20:11:02,266 INFO solr.SolrIndexWriter - Indexing 250
>>>>>> documents
>>>>>> 2015-04-07 20:11:02,267 INFO solr.SolrIndexWriter - Deleting 0
>>>>>> documents
>>>>>> 2015-04-07 20:11:02,512 INFO solr.SolrIndexWriter - Indexing 250
>>>>>> documents
>>>>>> *2015-04-07 20:11:02,576 WARN mapred.LocalJobRunner -
>>>>>> job_local1831338118_0001*
>>>>>> *org.apache.solr.common.SolrException: HTTP method POST is not
>>>>>> supported
>>>>>> by this URL*
>>>>>> *
>>>>>> *
>>>>>> *HTTP method POST is not supported by this URL*
>>>>>>
>>>>>> request: http://localhost:8983/solr/
>>>>>> at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.
>>>>>> request(CommonsHttpSolrServer.java:430)
>>>>>> at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.
>>>>>> request(CommonsHttpSolrServer.java:244)
>>>>>> at org.apache.solr.client.solrj.request.AbstractUpdateRequest.
>>>>>> process(AbstractUpdateRequest.java:105)
>>>>>> at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(
>>>>>> SolrIndexWriter.java:135)
>>>>>> at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:88)
>>>>>> at org.apache.nutch.indexer.IndexerOutputFormat$1.write(
>>>>>> IndexerOutputFormat.java:50)
>>>>>> at org.apache.nutch.indexer.IndexerOutputFormat$1.write(
>>>>>> IndexerOutputFormat.java:41)
>>>>>> at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(
>>>>>> ReduceTask.java:458)
>>>>>> at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:500)
>>>>>> at org.apache.nutch.indexer.IndexerMapReduce.reduce(
>>>>>> IndexerMapReduce.java:323)
>>>>>> at org.apache.nutch.indexer.IndexerMapReduce.reduce(
>>>>>> IndexerMapReduce.java:53)
>>>>>> at org.apache.hadoop.mapred.ReduceTask.runOldReducer(
>>>>>> ReduceTask.java:522)
>>>>>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)
>>>>>> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(
>>>>>> LocalJobRunner.java:398)
>>>>>> 2015-04-07 20:11:02,724 ERROR indexer.IndexingJob - Indexer:
>>>>>> java.io.IOException: Job failed!
>>>>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>>>>>> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
>>>>>> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
>>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>>> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
>>>>>>
>>>>>>
>>>>>> On Tuesday 07 April 2015 06:53 PM, Jeff Cocking wrote:
>>>>>>
>>>>>> The command you are using is not pointing to the specific solr index
>>>>>> you
>>>>>>
>>>>>>> created. The http://localhost:8983/solr needs to be changed to the
>>>>>>> URL
>>>>>>> for
>>>>>>> the core created. It should look like
>>>>>>> http://localhost:8983/solr/#/new_core_name.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Apr 7, 2015 at 2:33 AM, Anchit Jain <anchitjain1234@gmail.com
>>>>>>> wrote:
>>>>>>>
>>>>>>> I followed instructions as given on your blog and created a new
>>>>>>> core
>>>>>>> for
>>>>>>>
>>>>>>> nutch data and copied schema.xml of nutch into it.
>>>>>>>> Then I run the following command in nutch working directory
>>>>>>>> bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/
>>>>>>>> -linkdb
>>>>>>>> crawl/linkdb/ crawl/segments/20150406231502/ -filter -normalize
>>>>>>>>
>>>>>>>> But then also the same error is coming as like previous runs.
>>>>>>>>
>>>>>>>> On Tue, 7 Apr 2015 at 10:00 Jeff Cocking <je...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Solr5 is multicore by default. You have not finished the install
>>>>>>>> by
>>>>>>>>
>>>>>>>> setting up solr5's core. I would suggest you look at the link I
>>>>>>>>> sent
>>>>>>>>> to
>>>>>>>>> finish up your setup.
>>>>>>>>>
>>>>>>>>> After you finish your install your solr URL will be
>>>>>>>>> http://localhost:8983/solr/#/core_name.
>>>>>>>>>
>>>>>>>>> Jeff Cocking
>>>>>>>>>
>>>>>>>>> I apologize for my brevity.
>>>>>>>>> This was sent from my mobile device while I should be focusing on
>>>>>>>>> something else.....
>>>>>>>>> Like a meeting, driving, family, etc.
>>>>>>>>>
>>>>>>>>> On Apr 6, 2015, at 11:16 PM, Anchit Jain <
>>>>>>>>> anchitjain1234@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> I have already installed Solr.I want to integrate it with nutch.
>>>>>>>>>
>>>>>>>>>> Whenever I try to issue this command to nutch
>>>>>>>>>> ""bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/
>>>>>>>>>>
>>>>>>>>>> -linkdb
>>>>>>>>>>
>>>>>>>>> crawl/linkdb/ crawl/segments/20150406231502/ -filter -normalize"
>>>>>>>>>
>>>>>>>>> I always get a error
>>>>>>>>>> Indexer: java.io.IOException: Job failed!
>>>>>>>>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>>>>>>>>>> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.
>>>>>>>>>> java:114)
>>>>>>>>>> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
>>>>>>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>>>>>>> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Here is the complete hadoop log for the process.I have underlined
>>>>>>>>>> the
>>>>>>>>>>
>>>>>>>>>> error
>>>>>>>>>>
>>>>>>>>> part in it.
>>>>>>>>>
>>>>>>>>>> 2015-04-07 09:38:06,613 INFO indexer.IndexingJob - Indexer:
>>>>>>>>>> starting
>>>>>>>>>>
>>>>>>>>>> at
>>>>>>>>>>
>>>>>>>>> 2015-04-07 09:38:06
>>>>>>>>>
>>>>>>>>> 2015-04-07 09:38:06,684 INFO indexer.IndexingJob - Indexer:
>>>>>>>>>> deleting
>>>>>>>>>>
>>>>>>>>>> gone
>>>>>>>>>>
>>>>>>>>> documents: false
>>>>>>>>>
>>>>>>>>>> 2015-04-07 09:38:06,685 INFO indexer.IndexingJob - Indexer: URL
>>>>>>>>>>
>>>>>>>>>> filtering:
>>>>>>>>>>
>>>>>>>>> true
>>>>>>>>>
>>>>>>>>>> 2015-04-07 09:38:06,685 INFO indexer.IndexingJob - Indexer: URL
>>>>>>>>>> normalizing: true
>>>>>>>>>> 2015-04-07 09:38:06,893 INFO indexer.IndexWriters - Adding
>>>>>>>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter
>>>>>>>>>> 2015-04-07 09:38:06,893 INFO indexer.IndexingJob - Active
>>>>>>>>>>
>>>>>>>>>> IndexWriters :
>>>>>>>>>>
>>>>>>>>> SOLRIndexWriter
>>>>>>>>>
>>>>>>>>> solr.server.url : URL of the SOLR instance (mandatory)
>>>>>>>>>> solr.commit.size : buffer size when sending to SOLR (default 1000)
>>>>>>>>>> solr.mapping.file : name of the mapping file for fields (default
>>>>>>>>>> solrindex-mapping.xml)
>>>>>>>>>> solr.auth : use authentication (default false)
>>>>>>>>>> solr.auth.username : use authentication (default false)
>>>>>>>>>> solr.auth : username for authentication
>>>>>>>>>> solr.auth.password : password for authentication
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2015-04-07 09:38:06,898 INFO indexer.IndexerMapReduce -
>>>>>>>>>>
>>>>>>>>>> IndexerMapReduce:
>>>>>>>>>>
>>>>>>>>> crawldb: crawl/crawldb
>>>>>>>>>
>>>>>>>>>> 2015-04-07 09:38:06,898 INFO indexer.IndexerMapReduce -
>>>>>>>>>>
>>>>>>>>>> IndexerMapReduce:
>>>>>>>>>>
>>>>>>>>> linkdb: crawl/linkdb
>>>>>>>>>
>>>>>>>>>> 2015-04-07 09:38:06,898 INFO indexer.IndexerMapReduce -
>>>>>>>>>>
>>>>>>>>>> IndexerMapReduces:
>>>>>>>>>>
>>>>>>>>> adding segment: crawl/segments/20150406231502
>>>>>>>>>
>>>>>>>>>> 2015-04-07 09:38:07,036 WARN util.NativeCodeLoader - Unable to
>>>>>>>>>> load
>>>>>>>>>> native-hadoop library for your platform... using builtin-java
>>>>>>>>>> classes
>>>>>>>>>>
>>>>>>>>>> where
>>>>>>>>>>
>>>>>>>>> applicable
>>>>>>>>>
>>>>>>>>>> 2015-04-07 09:38:07,540 INFO anchor.AnchorIndexingFilter - Anchor
>>>>>>>>>> deduplication is: off
>>>>>>>>>> 2015-04-07 09:38:07,565 INFO regex.RegexURLNormalizer - can't find
>>>>>>>>>>
>>>>>>>>>> rules
>>>>>>>>>>
>>>>>>>>> for scope 'indexer', using default
>>>>>>>>>
>>>>>>>>> 2015-04-07 09:38:09,552 INFO regex.RegexURLNormalizer - can't find
>>>>>>>>>> rules
>>>>>>>>>>
>>>>>>>>> for scope 'indexer', using default
>>>>>>>>>
>>>>>>>>> 2015-04-07 09:38:10,642 INFO regex.RegexURLNormalizer - can't find
>>>>>>>>>> rules
>>>>>>>>>>
>>>>>>>>> for scope 'indexer', using default
>>>>>>>>>
>>>>>>>>> 2015-04-07 09:38:10,734 INFO regex.RegexURLNormalizer - can't find
>>>>>>>>>> rules
>>>>>>>>>>
>>>>>>>>> for scope 'indexer', using default
>>>>>>>>>
>>>>>>>>> 2015-04-07 09:38:10,895 INFO regex.RegexURLNormalizer - can't find
>>>>>>>>>> rules
>>>>>>>>>>
>>>>>>>>> for scope 'indexer', using default
>>>>>>>>>
>>>>>>>>> 2015-04-07 09:38:11,088 INFO regex.RegexURLNormalizer - can't find
>>>>>>>>>> rules
>>>>>>>>>>
>>>>>>>>> for scope 'indexer', using default
>>>>>>>>>
>>>>>>>>> 2015-04-07 09:38:11,219 INFO indexer.IndexWriters - Adding
>>>>>>>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter
>>>>>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source:
>>>>>>>>>> content
>>>>>>>>>> dest: content
>>>>>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source:
>>>>>>>>>> title
>>>>>>>>>>
>>>>>>>>>> dest:
>>>>>>>>>>
>>>>>>>>> title
>>>>>>>>>
>>>>>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: host
>>>>>>>>>>
>>>>>>>>>> dest:
>>>>>>>>>>
>>>>>>>>> host
>>>>>>>>>
>>>>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source:
>>>>>>>>>> segment
>>>>>>>>>> dest: segment
>>>>>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source:
>>>>>>>>>> boost
>>>>>>>>>>
>>>>>>>>>> dest:
>>>>>>>>>>
>>>>>>>>> boost
>>>>>>>>>
>>>>>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source:
>>>>>>>>>> digest
>>>>>>>>>>
>>>>>>>>>> dest:
>>>>>>>>>>
>>>>>>>>> digest
>>>>>>>>>
>>>>>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source:
>>>>>>>>>> tstamp
>>>>>>>>>>
>>>>>>>>>> dest:
>>>>>>>>>>
>>>>>>>>> tstamp
>>>>>>>>>
>>>>>>>>>> 2015-04-07 09:38:11,526 INFO solr.SolrIndexWriter - Indexing 250
>>>>>>>>>>
>>>>>>>>>> documents
>>>>>>>>>>
>>>>>>>>> 2015-04-07 09:38:11,526 INFO solr.SolrIndexWriter - Deleting 0
>>>>>>>>>
>>>>>>>>>> documents
>>>>>>>>>>
>>>>>>>>> 2015-04-07 09:38:11,644 INFO solr.SolrIndexWriter - Indexing 250
>>>>>>>>> documents
>>>>>>>>>
>>>>>>>>> *2015-04-07 09:38:11,699 WARN mapred.LocalJobRunner -
>>>>>>>>>
>>>>>>>>>> job_local1245074757_0001*
>>>>>>>>>> *org.apache.solr.common.SolrException: Not Found*
>>>>>>>>>>
>>>>>>>>>> *Not Found*
>>>>>>>>>>
>>>>>>>>>> *request: http://localhost:8983/solr/update?wt=javabin&version=2
>>>>>>>>>> <http://localhost:8983/solr/update?wt=javabin&version=2>*
>>>>>>>>>> * at
>>>>>>>>>> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.
>>>>>>>>>>
>>>>>>>>>> request(CommonsHttpSolrServer.java:430)*
>>>>>>>>>>
>>>>>>>>> * at
>>>>>>>>>
>>>>>>>>>> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.
>>>>>>>>>>
>>>>>>>>>> request(CommonsHttpSolrServer.java:244)*
>>>>>>>>>>
>>>>>>>>> * at
>>>>>>>>>
>>>>>>>>>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.
>>>>>>>>>>
>>>>>>>>>> process(AbstractUpdateRequest.java:105)*
>>>>>>>>>>
>>>>>>>>> * at
>>>>>>>>>
>>>>>>>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(
>>>>>>>>>>
>>>>>>>>>> SolrIndexWriter.java:135)*
>>>>>>>>>>
>>>>>>>>> * at org.apache.nutch.indexer.IndexWriters.write(
>>>>>>>>>
>>>>>>>>>> IndexWriters.java:88)*
>>>>>>>>>> * at
>>>>>>>>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(
>>>>>>>>>>
>>>>>>>>>> IndexerOutputFormat.java:50)*
>>>>>>>>>>
>>>>>>>>> * at
>>>>>>>>>
>>>>>>>>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(
>>>>>>>>>>
>>>>>>>>>> IndexerOutputFormat.java:41)*
>>>>>>>>>>
>>>>>>>>> * at
>>>>>>>>>
>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(
>>>>>>>>>>
>>>>>>>>>> ReduceTask.java:458)*
>>>>>>>>>>
>>>>>>>>> * at
>>>>>>>>>
>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask$3.collect(
>>>>>>>>>> ReduceTask.java:500)*
>>>>>>>>>>
>>>>>>>>> * at
>>>>>>>>>
>>>>>>>>> org.apache.nutch.indexer.IndexerMapReduce.reduce(
>>>>>>>>>> IndexerMapReduce.java:323)*
>>>>>>>>>>
>>>>>>>>> * at
>>>>>>>>>
>>>>>>>>>> org.apache.nutch.indexer.IndexerMapReduce.reduce(
>>>>>>>>>>
>>>>>>>>>> IndexerMapReduce.java:53)*
>>>>>>>>>>
>>>>>>>>> * at org.apache.hadoop.mapred.ReduceTask.runOldReducer(
>>>>>>>>>
>>>>>>>>>> ReduceTask.java:522)*
>>>>>>>>>>
>>>>>>>>> * at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.
>>>>>>>>> java:421)*
>>>>>>>>>
>>>>>>>>>> * at
>>>>>>>>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(
>>>>>>>>>>
>>>>>>>>>> LocalJobRunner.java:398)*
>>>>>>>>>>
>>>>>>>>> *2015-04-07 09:38:12,408 ERROR indexer.IndexingJob - Indexer:
>>>>>>>>>
>>>>>>>>>> java.io.IOException: Job failed!*
>>>>>>>>>> * at org.apache.hadoop.mapred.JobClient.runJob(JobClient.
>>>>>>>>>> java:1357)*
>>>>>>>>>> * at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.
>>>>>>>>>> java:114)*
>>>>>>>>>> * at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.
>>>>>>>>>> java:176)*
>>>>>>>>>> * at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)*
>>>>>>>>>> * at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.
>>>>>>>>>> java:186)*
>>>>>>>>>>
>>>>>>>>>> On Tue, 7 Apr 2015 at 03:18 Jeff Cocking <jeff.cocking@gmail.com
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>> With Solr5.0.0 you can skip that step. Solr will auto create your
>>>>>>>>>> schema
>>>>>>>>>> document based on the data being provided.
>>>>>>>>>>
>>>>>>>>>> One of the new features with Solr5 is the install/service
>>>>>>>>>>> feature. I
>>>>>>>>>>>
>>>>>>>>>>> did a
>>>>>>>>>>>
>>>>>>>>>> quick write up on how to install Solr5 on Centos. Might be
>>>>>>>>>> something
>>>>>>>>>>
>>>>>>>>>> useful there for you.
>>>>>>>>>>> http://www.cocking.com/apache-solr-5-0-install-on-centos-7/
>>>>>>>>>>>
>>>>>>>>>>> jeff
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Apr 6, 2015 at 3:13 PM, Anchit Jain <
>>>>>>>>>>> anchitjain1234@gmail.com
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> I want to index nutch results using *Solr 5.0* but as
>>>>>>>>>>> mentioned in
>>>>>>>>>>>
>>>>>>>>>>> https://wiki.apache.org/nutch/NutchTutorial there is no
>>>>>>>>>>>> directory
>>>>>>>>>>>> ${APACHE_SOLR_HOME}/example/solr/collection1/conf/
>>>>>>>>>>>> in solr 5.0 . So where I have to copy *schema.xml*?
>>>>>>>>>>>> Also there is no *start.jar* present in example directory.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
Re: Nutch 1.9 integration with Solr 5.0.0
Posted by yeshwanth kumar <ye...@gmail.com>.
Which db you are using?
If you have any time constraints for your task.
Write a mapreduce job which reads from DB and then index into solr.
I use crawl script for fetching parsing storing and indexing all at a time
instead of doing them individually by nutch script.
Error clearly specifies about url so do a curl or write a simple java code
using http get post for reading and writing documents. This will give you
some understanding whether the url is correct or not.
Sent from mobile, please excuse any typographical errors.
On Apr 7, 2015 12:00 PM, "Anchit Jain" <an...@gmail.com> wrote:
> Same error :-( .
>
> So no workaround for the error?
>
> On Tuesday 07 April 2015 10:06 PM, Jeff Cocking wrote:
>
>> I use the following for all my indexing work.
>>
>> Usage: bin/crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds>
>> Example: bin/crawl urls/ TestCrawl/ http://localhost:8983/solr/ 2
>>
>>
>> On Tue, Apr 7, 2015 at 11:20 AM, Anchit Jain <an...@gmail.com>
>> wrote:
>>
>> Yes it is working correctly from the browser. I can also manually add the
>>> documents from web browser.But not through nutch.
>>> I am not able to figure out where the problem is.
>>>
>>> Is there any manual way of adding crawldb and linkdb to the nutch besides
>>> that command?
>>>
>>>
>>> On Tuesday 07 April 2015 09:47 PM, Jeff Cocking wrote:
>>>
>>> There can be numerous reasons....Hosts.conf, firewall, etc. These are
>>>> all
>>>> unique to your system.
>>>>
>>>> Have you viewed the solr admin panel via a browser? This is a critical
>>>> step in the installation. This validates SOLR can accept HTTP commands.
>>>>
>>>> On Tue, Apr 7, 2015 at 9:53 AM, Anchit Jain <an...@gmail.com>
>>>> wrote:
>>>>
>>>> I created a new core named *foo*. Than I copied the *schema.xml* from
>>>>
>>>>> *nutch* into *var/solr/data/foo/conf* with changes as described in*
>>>>> https://wiki.apache.org/nutch/NutchTutorial*.
>>>>> I changed the url to*http://localhost:8983/solr/#/foo*
>>>>> so new command is
>>>>> "*bin/nutch solrindex http://localhost:8983/solr/#/foo crawl/crawldb/
>>>>> -linkdb crawl/linkdb/ crawl/segments/20150406231502/ -filter
>>>>> -normalize*"
>>>>> But now I am getting error
>>>>> *org.apache.solr.common.SolrException: HTTP method POST is not
>>>>> supported
>>>>> by this URL*
>>>>> *
>>>>> *
>>>>> Is some other change is also required in URL to support POST requests?
>>>>>
>>>>> Full log
>>>>>
>>>>> 2015-04-07 20:10:56,068 INFO indexer.IndexingJob - Indexer: starting
>>>>> at
>>>>> 2015-04-07 20:10:56
>>>>> 2015-04-07 20:10:56,178 INFO indexer.IndexingJob - Indexer: deleting
>>>>> gone
>>>>> documents: false
>>>>> 2015-04-07 20:10:56,178 INFO indexer.IndexingJob - Indexer: URL
>>>>> filtering: true
>>>>> 2015-04-07 20:10:56,178 INFO indexer.IndexingJob - Indexer: URL
>>>>> normalizing: true
>>>>> 2015-04-07 20:10:56,727 INFO indexer.IndexWriters - Adding
>>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter
>>>>> 2015-04-07 20:10:56,727 INFO indexer.IndexingJob - Active
>>>>> IndexWriters :
>>>>> SOLRIndexWriter
>>>>> solr.server.url : URL of the SOLR instance (mandatory)
>>>>> solr.commit.size : buffer size when sending to SOLR (default 1000)
>>>>> solr.mapping.file : name of the mapping file for fields (default
>>>>> solrindex-mapping.xml)
>>>>> solr.auth : use authentication (default false)
>>>>> solr.auth.username : use authentication (default false)
>>>>> solr.auth : username for authentication
>>>>> solr.auth.password : password for authentication
>>>>>
>>>>>
>>>>> 2015-04-07 20:10:56,772 INFO indexer.IndexerMapReduce -
>>>>> IndexerMapReduce:
>>>>> crawldb: crawl/crawldb
>>>>> 2015-04-07 20:10:56,772 INFO indexer.IndexerMapReduce -
>>>>> IndexerMapReduce:
>>>>> linkdb: crawl/linkdb
>>>>> 2015-04-07 20:10:56,772 INFO indexer.IndexerMapReduce -
>>>>> IndexerMapReduces: adding segment: crawl/segments/20150406231502
>>>>> 2015-04-07 20:10:57,205 WARN util.NativeCodeLoader - Unable to load
>>>>> native-hadoop library for your platform... using builtin-java classes
>>>>> where
>>>>> applicable
>>>>> 2015-04-07 20:10:58,020 INFO anchor.AnchorIndexingFilter - Anchor
>>>>> deduplication is: off
>>>>> 2015-04-07 20:10:58,134 INFO regex.RegexURLNormalizer - can't find
>>>>> rules
>>>>> for scope 'indexer', using default
>>>>> 2015-04-07 20:11:00,114 INFO regex.RegexURLNormalizer - can't find
>>>>> rules
>>>>> for scope 'indexer', using default
>>>>> 2015-04-07 20:11:01,205 INFO regex.RegexURLNormalizer - can't find
>>>>> rules
>>>>> for scope 'indexer', using default
>>>>> 2015-04-07 20:11:01,344 INFO regex.RegexURLNormalizer - can't find
>>>>> rules
>>>>> for scope 'indexer', using default
>>>>> 2015-04-07 20:11:01,577 INFO regex.RegexURLNormalizer - can't find
>>>>> rules
>>>>> for scope 'indexer', using default
>>>>> 2015-04-07 20:11:01,788 INFO regex.RegexURLNormalizer - can't find
>>>>> rules
>>>>> for scope 'indexer', using default
>>>>> 2015-04-07 20:11:01,921 INFO indexer.IndexWriters - Adding
>>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter
>>>>> 2015-04-07 20:11:01,986 INFO solr.SolrMappingReader - source: content
>>>>> dest: content
>>>>> 2015-04-07 20:11:01,986 INFO solr.SolrMappingReader - source: title
>>>>> dest:
>>>>> title
>>>>> 2015-04-07 20:11:01,986 INFO solr.SolrMappingReader - source: host
>>>>> dest:
>>>>> host
>>>>> 2015-04-07 20:11:01,986 INFO solr.SolrMappingReader - source: segment
>>>>> dest: segment
>>>>> 2015-04-07 20:11:01,987 INFO solr.SolrMappingReader - source: boost
>>>>> dest:
>>>>> boost
>>>>> 2015-04-07 20:11:01,987 INFO solr.SolrMappingReader - source: digest
>>>>> dest: digest
>>>>> 2015-04-07 20:11:01,987 INFO solr.SolrMappingReader - source: tstamp
>>>>> dest: tstamp
>>>>> 2015-04-07 20:11:02,266 INFO solr.SolrIndexWriter - Indexing 250
>>>>> documents
>>>>> 2015-04-07 20:11:02,267 INFO solr.SolrIndexWriter - Deleting 0
>>>>> documents
>>>>> 2015-04-07 20:11:02,512 INFO solr.SolrIndexWriter - Indexing 250
>>>>> documents
>>>>> *2015-04-07 20:11:02,576 WARN mapred.LocalJobRunner -
>>>>> job_local1831338118_0001*
>>>>> *org.apache.solr.common.SolrException: HTTP method POST is not
>>>>> supported
>>>>> by this URL*
>>>>> *
>>>>> *
>>>>> *HTTP method POST is not supported by this URL*
>>>>>
>>>>> request: http://localhost:8983/solr/
>>>>> at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.
>>>>> request(CommonsHttpSolrServer.java:430)
>>>>> at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.
>>>>> request(CommonsHttpSolrServer.java:244)
>>>>> at org.apache.solr.client.solrj.request.AbstractUpdateRequest.
>>>>> process(AbstractUpdateRequest.java:105)
>>>>> at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(
>>>>> SolrIndexWriter.java:135)
>>>>> at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:88)
>>>>> at org.apache.nutch.indexer.IndexerOutputFormat$1.write(
>>>>> IndexerOutputFormat.java:50)
>>>>> at org.apache.nutch.indexer.IndexerOutputFormat$1.write(
>>>>> IndexerOutputFormat.java:41)
>>>>> at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(
>>>>> ReduceTask.java:458)
>>>>> at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:500)
>>>>> at org.apache.nutch.indexer.IndexerMapReduce.reduce(
>>>>> IndexerMapReduce.java:323)
>>>>> at org.apache.nutch.indexer.IndexerMapReduce.reduce(
>>>>> IndexerMapReduce.java:53)
>>>>> at org.apache.hadoop.mapred.ReduceTask.runOldReducer(
>>>>> ReduceTask.java:522)
>>>>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)
>>>>> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(
>>>>> LocalJobRunner.java:398)
>>>>> 2015-04-07 20:11:02,724 ERROR indexer.IndexingJob - Indexer:
>>>>> java.io.IOException: Job failed!
>>>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>>>>> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
>>>>> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
>>>>>
>>>>>
>>>>> On Tuesday 07 April 2015 06:53 PM, Jeff Cocking wrote:
>>>>>
>>>>> The command you are using is not pointing to the specific solr index
>>>>> you
>>>>>
>>>>>> created. The http://localhost:8983/solr needs to be changed to the
>>>>>> URL
>>>>>> for
>>>>>> the core created. It should look like
>>>>>> http://localhost:8983/solr/#/new_core_name.
>>>>>>
>>>>>>
>>>>>> On Tue, Apr 7, 2015 at 2:33 AM, Anchit Jain <anchitjain1234@gmail.com
>>>>>> >
>>>>>> wrote:
>>>>>>
>>>>>> I followed instructions as given on your blog and created a new
>>>>>> core
>>>>>> for
>>>>>>
>>>>>> nutch data and copied schema.xml of nutch into it.
>>>>>>> Then I run the following command in nutch working directory
>>>>>>> bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/
>>>>>>> -linkdb
>>>>>>> crawl/linkdb/ crawl/segments/20150406231502/ -filter -normalize
>>>>>>>
>>>>>>> But then also the same error is coming as like previous runs.
>>>>>>>
>>>>>>> On Tue, 7 Apr 2015 at 10:00 Jeff Cocking <je...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> Solr5 is multicore by default. You have not finished the install
>>>>>>> by
>>>>>>>
>>>>>>> setting up solr5's core. I would suggest you look at the link I
>>>>>>>> sent
>>>>>>>> to
>>>>>>>> finish up your setup.
>>>>>>>>
>>>>>>>> After you finish your install your solr URL will be
>>>>>>>> http://localhost:8983/solr/#/core_name.
>>>>>>>>
>>>>>>>> Jeff Cocking
>>>>>>>>
>>>>>>>> I apologize for my brevity.
>>>>>>>> This was sent from my mobile device while I should be focusing on
>>>>>>>> something else.....
>>>>>>>> Like a meeting, driving, family, etc.
>>>>>>>>
>>>>>>>> On Apr 6, 2015, at 11:16 PM, Anchit Jain <
>>>>>>>> anchitjain1234@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> I have already installed Solr.I want to integrate it with nutch.
>>>>>>>>
>>>>>>>>> Whenever I try to issue this command to nutch
>>>>>>>>> ""bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/
>>>>>>>>>
>>>>>>>>> -linkdb
>>>>>>>>>
>>>>>>>> crawl/linkdb/ crawl/segments/20150406231502/ -filter -normalize"
>>>>>>>>
>>>>>>>> I always get a error
>>>>>>>>>
>>>>>>>>> Indexer: java.io.IOException: Job failed!
>>>>>>>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>>>>>>>>> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.
>>>>>>>>> java:114)
>>>>>>>>> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
>>>>>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>>>>>> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Here is the complete hadoop log for the process.I have underlined
>>>>>>>>> the
>>>>>>>>>
>>>>>>>>> error
>>>>>>>>>
>>>>>>>> part in it.
>>>>>>>>
>>>>>>>>> 2015-04-07 09:38:06,613 INFO indexer.IndexingJob - Indexer:
>>>>>>>>> starting
>>>>>>>>>
>>>>>>>>> at
>>>>>>>>>
>>>>>>>> 2015-04-07 09:38:06
>>>>>>>>
>>>>>>>> 2015-04-07 09:38:06,684 INFO indexer.IndexingJob - Indexer:
>>>>>>>>> deleting
>>>>>>>>>
>>>>>>>>> gone
>>>>>>>>>
>>>>>>>> documents: false
>>>>>>>>
>>>>>>>>> 2015-04-07 09:38:06,685 INFO indexer.IndexingJob - Indexer: URL
>>>>>>>>>
>>>>>>>>> filtering:
>>>>>>>>>
>>>>>>>> true
>>>>>>>>
>>>>>>>>> 2015-04-07 09:38:06,685 INFO indexer.IndexingJob - Indexer: URL
>>>>>>>>> normalizing: true
>>>>>>>>> 2015-04-07 09:38:06,893 INFO indexer.IndexWriters - Adding
>>>>>>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter
>>>>>>>>> 2015-04-07 09:38:06,893 INFO indexer.IndexingJob - Active
>>>>>>>>>
>>>>>>>>> IndexWriters :
>>>>>>>>>
>>>>>>>> SOLRIndexWriter
>>>>>>>>
>>>>>>>> solr.server.url : URL of the SOLR instance (mandatory)
>>>>>>>>> solr.commit.size : buffer size when sending to SOLR (default 1000)
>>>>>>>>> solr.mapping.file : name of the mapping file for fields (default
>>>>>>>>> solrindex-mapping.xml)
>>>>>>>>> solr.auth : use authentication (default false)
>>>>>>>>> solr.auth.username : use authentication (default false)
>>>>>>>>> solr.auth : username for authentication
>>>>>>>>> solr.auth.password : password for authentication
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2015-04-07 09:38:06,898 INFO indexer.IndexerMapReduce -
>>>>>>>>>
>>>>>>>>> IndexerMapReduce:
>>>>>>>>>
>>>>>>>> crawldb: crawl/crawldb
>>>>>>>>
>>>>>>>>> 2015-04-07 09:38:06,898 INFO indexer.IndexerMapReduce -
>>>>>>>>>
>>>>>>>>> IndexerMapReduce:
>>>>>>>>>
>>>>>>>> linkdb: crawl/linkdb
>>>>>>>>
>>>>>>>>> 2015-04-07 09:38:06,898 INFO indexer.IndexerMapReduce -
>>>>>>>>>
>>>>>>>>> IndexerMapReduces:
>>>>>>>>>
>>>>>>>> adding segment: crawl/segments/20150406231502
>>>>>>>>
>>>>>>>>> 2015-04-07 09:38:07,036 WARN util.NativeCodeLoader - Unable to
>>>>>>>>> load
>>>>>>>>> native-hadoop library for your platform... using builtin-java
>>>>>>>>> classes
>>>>>>>>>
>>>>>>>>> where
>>>>>>>>>
>>>>>>>> applicable
>>>>>>>>
>>>>>>>>> 2015-04-07 09:38:07,540 INFO anchor.AnchorIndexingFilter - Anchor
>>>>>>>>> deduplication is: off
>>>>>>>>> 2015-04-07 09:38:07,565 INFO regex.RegexURLNormalizer - can't find
>>>>>>>>>
>>>>>>>>> rules
>>>>>>>>>
>>>>>>>> for scope 'indexer', using default
>>>>>>>>
>>>>>>>> 2015-04-07 09:38:09,552 INFO regex.RegexURLNormalizer - can't find
>>>>>>>>>
>>>>>>>>> rules
>>>>>>>>>
>>>>>>>> for scope 'indexer', using default
>>>>>>>>
>>>>>>>> 2015-04-07 09:38:10,642 INFO regex.RegexURLNormalizer - can't find
>>>>>>>>>
>>>>>>>>> rules
>>>>>>>>>
>>>>>>>> for scope 'indexer', using default
>>>>>>>>
>>>>>>>> 2015-04-07 09:38:10,734 INFO regex.RegexURLNormalizer - can't find
>>>>>>>>>
>>>>>>>>> rules
>>>>>>>>>
>>>>>>>> for scope 'indexer', using default
>>>>>>>>
>>>>>>>> 2015-04-07 09:38:10,895 INFO regex.RegexURLNormalizer - can't find
>>>>>>>>>
>>>>>>>>> rules
>>>>>>>>>
>>>>>>>> for scope 'indexer', using default
>>>>>>>>
>>>>>>>> 2015-04-07 09:38:11,088 INFO regex.RegexURLNormalizer - can't find
>>>>>>>>>
>>>>>>>>> rules
>>>>>>>>>
>>>>>>>> for scope 'indexer', using default
>>>>>>>>
>>>>>>>> 2015-04-07 09:38:11,219 INFO indexer.IndexWriters - Adding
>>>>>>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter
>>>>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source:
>>>>>>>>> content
>>>>>>>>> dest: content
>>>>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source:
>>>>>>>>> title
>>>>>>>>>
>>>>>>>>> dest:
>>>>>>>>>
>>>>>>>> title
>>>>>>>>
>>>>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: host
>>>>>>>>>
>>>>>>>>> dest:
>>>>>>>>>
>>>>>>>> host
>>>>>>>>
>>>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source:
>>>>>>>>> segment
>>>>>>>>> dest: segment
>>>>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source:
>>>>>>>>> boost
>>>>>>>>>
>>>>>>>>> dest:
>>>>>>>>>
>>>>>>>> boost
>>>>>>>>
>>>>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source:
>>>>>>>>> digest
>>>>>>>>>
>>>>>>>>> dest:
>>>>>>>>>
>>>>>>>> digest
>>>>>>>>
>>>>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source:
>>>>>>>>> tstamp
>>>>>>>>>
>>>>>>>>> dest:
>>>>>>>>>
>>>>>>>> tstamp
>>>>>>>>
>>>>>>>>> 2015-04-07 09:38:11,526 INFO solr.SolrIndexWriter - Indexing 250
>>>>>>>>>
>>>>>>>>> documents
>>>>>>>>>
>>>>>>>> 2015-04-07 09:38:11,526 INFO solr.SolrIndexWriter - Deleting 0
>>>>>>>>
>>>>>>>>> documents
>>>>>>>>>
>>>>>>>> 2015-04-07 09:38:11,644 INFO solr.SolrIndexWriter - Indexing 250
>>>>>>>> documents
>>>>>>>>
>>>>>>>> *2015-04-07 09:38:11,699 WARN mapred.LocalJobRunner -
>>>>>>>>
>>>>>>>>> job_local1245074757_0001*
>>>>>>>>> *org.apache.solr.common.SolrException: Not Found*
>>>>>>>>>
>>>>>>>>> *Not Found*
>>>>>>>>>
>>>>>>>>> *request: http://localhost:8983/solr/update?wt=javabin&version=2
>>>>>>>>> <http://localhost:8983/solr/update?wt=javabin&version=2>*
>>>>>>>>> * at
>>>>>>>>> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.
>>>>>>>>>
>>>>>>>>> request(CommonsHttpSolrServer.java:430)*
>>>>>>>>>
>>>>>>>> * at
>>>>>>>>
>>>>>>>>> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.
>>>>>>>>>
>>>>>>>>> request(CommonsHttpSolrServer.java:244)*
>>>>>>>>>
>>>>>>>> * at
>>>>>>>>
>>>>>>>>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.
>>>>>>>>>
>>>>>>>>> process(AbstractUpdateRequest.java:105)*
>>>>>>>>>
>>>>>>>> * at
>>>>>>>>
>>>>>>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(
>>>>>>>>>
>>>>>>>>> SolrIndexWriter.java:135)*
>>>>>>>>>
>>>>>>>> * at org.apache.nutch.indexer.IndexWriters.write(
>>>>>>>>
>>>>>>>>> IndexWriters.java:88)*
>>>>>>>>> * at
>>>>>>>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(
>>>>>>>>>
>>>>>>>>> IndexerOutputFormat.java:50)*
>>>>>>>>>
>>>>>>>> * at
>>>>>>>>
>>>>>>>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(
>>>>>>>>>
>>>>>>>>> IndexerOutputFormat.java:41)*
>>>>>>>>>
>>>>>>>> * at
>>>>>>>>
>>>>>>>>> org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(
>>>>>>>>>
>>>>>>>>> ReduceTask.java:458)*
>>>>>>>>>
>>>>>>>> * at
>>>>>>>>
>>>>>>>>> org.apache.hadoop.mapred.ReduceTask$3.collect(
>>>>>>>>> ReduceTask.java:500)*
>>>>>>>>>
>>>>>>>> * at
>>>>>>>>
>>>>>>>> org.apache.nutch.indexer.IndexerMapReduce.reduce(
>>>>>>>>>
>>>>>>>>> IndexerMapReduce.java:323)*
>>>>>>>>>
>>>>>>>> * at
>>>>>>>>
>>>>>>>>> org.apache.nutch.indexer.IndexerMapReduce.reduce(
>>>>>>>>>
>>>>>>>>> IndexerMapReduce.java:53)*
>>>>>>>>>
>>>>>>>> * at org.apache.hadoop.mapred.ReduceTask.runOldReducer(
>>>>>>>>
>>>>>>>>> ReduceTask.java:522)*
>>>>>>>>>
>>>>>>>> * at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.
>>>>>>>> java:421)*
>>>>>>>>
>>>>>>>>> * at
>>>>>>>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(
>>>>>>>>>
>>>>>>>>> LocalJobRunner.java:398)*
>>>>>>>>>
>>>>>>>> *2015-04-07 09:38:12,408 ERROR indexer.IndexingJob - Indexer:
>>>>>>>>
>>>>>>>>> java.io.IOException: Job failed!*
>>>>>>>>> * at org.apache.hadoop.mapred.JobClient.runJob(JobClient.
>>>>>>>>> java:1357)*
>>>>>>>>> * at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.
>>>>>>>>> java:114)*
>>>>>>>>> * at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.
>>>>>>>>> java:176)*
>>>>>>>>> * at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)*
>>>>>>>>> * at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.
>>>>>>>>> java:186)*
>>>>>>>>>
>>>>>>>>> On Tue, 7 Apr 2015 at 03:18 Jeff Cocking <jeff.cocking@gmail.com
>>>>>>>>> >
>>>>>>>>>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>> With Solr5.0.0 you can skip that step. Solr will auto create your
>>>>>>>>> schema
>>>>>>>>> document based on the data being provided.
>>>>>>>>>
>>>>>>>>> One of the new features with Solr5 is the install/service
>>>>>>>>>> feature. I
>>>>>>>>>>
>>>>>>>>>> did a
>>>>>>>>>>
>>>>>>>>> quick write up on how to install Solr5 on Centos. Might be
>>>>>>>>> something
>>>>>>>>>
>>>>>>>>> useful there for you.
>>>>>>>>>>
>>>>>>>>>> http://www.cocking.com/apache-solr-5-0-install-on-centos-7/
>>>>>>>>>>
>>>>>>>>>> jeff
>>>>>>>>>>
>>>>>>>>>> On Mon, Apr 6, 2015 at 3:13 PM, Anchit Jain <
>>>>>>>>>> anchitjain1234@gmail.com
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> I want to index nutch results using *Solr 5.0* but as
>>>>>>>>>> mentioned in
>>>>>>>>>>
>>>>>>>>>> https://wiki.apache.org/nutch/NutchTutorial there is no
>>>>>>>>>>> directory
>>>>>>>>>>> ${APACHE_SOLR_HOME}/example/solr/collection1/conf/
>>>>>>>>>>> in solr 5.0 . So where I have to copy *schema.xml*?
>>>>>>>>>>> Also there is no *start.jar* present in example directory.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>
Re: Nutch 1.9 integration with Solr 5.0.0
Posted by Anchit Jain <an...@gmail.com>.
Same error :-( .
So no workaround for the error?
On Tuesday 07 April 2015 10:06 PM, Jeff Cocking wrote:
> I use the following for all my indexing work.
>
> Usage: bin/crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds>
> Example: bin/crawl urls/ TestCrawl/ http://localhost:8983/solr/ 2
>
>
> On Tue, Apr 7, 2015 at 11:20 AM, Anchit Jain <an...@gmail.com>
> wrote:
>
>> Yes it is working correctly from the browser. I can also manually add the
>> documents from web browser.But not through nutch.
>> I am not able to figure out where the problem is.
>>
>> Is there any manual way of adding crawldb and linkdb to the nutch besides
>> that command?
>>
>>
>> On Tuesday 07 April 2015 09:47 PM, Jeff Cocking wrote:
>>
>>> There can be numerous reasons....Hosts.conf, firewall, etc. These are all
>>> unique to your system.
>>>
>>> Have you viewed the solr admin panel via a browser? This is a critical
>>> step in the installation. This validates SOLR can accept HTTP commands.
>>>
>>> On Tue, Apr 7, 2015 at 9:53 AM, Anchit Jain <an...@gmail.com>
>>> wrote:
>>>
>>> I created a new core named *foo*. Than I copied the *schema.xml* from
>>>> *nutch* into *var/solr/data/foo/conf* with changes as described in*
>>>> https://wiki.apache.org/nutch/NutchTutorial*.
>>>> I changed the url to*http://localhost:8983/solr/#/foo*
>>>> so new command is
>>>> "*bin/nutch solrindex http://localhost:8983/solr/#/foo crawl/crawldb/
>>>> -linkdb crawl/linkdb/ crawl/segments/20150406231502/ -filter -normalize*"
>>>> But now I am getting error
>>>> *org.apache.solr.common.SolrException: HTTP method POST is not supported
>>>> by this URL*
>>>> *
>>>> *
>>>> Is some other change is also required in URL to support POST requests?
>>>>
>>>> Full log
>>>>
>>>> 2015-04-07 20:10:56,068 INFO indexer.IndexingJob - Indexer: starting at
>>>> 2015-04-07 20:10:56
>>>> 2015-04-07 20:10:56,178 INFO indexer.IndexingJob - Indexer: deleting
>>>> gone
>>>> documents: false
>>>> 2015-04-07 20:10:56,178 INFO indexer.IndexingJob - Indexer: URL
>>>> filtering: true
>>>> 2015-04-07 20:10:56,178 INFO indexer.IndexingJob - Indexer: URL
>>>> normalizing: true
>>>> 2015-04-07 20:10:56,727 INFO indexer.IndexWriters - Adding
>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter
>>>> 2015-04-07 20:10:56,727 INFO indexer.IndexingJob - Active IndexWriters :
>>>> SOLRIndexWriter
>>>> solr.server.url : URL of the SOLR instance (mandatory)
>>>> solr.commit.size : buffer size when sending to SOLR (default 1000)
>>>> solr.mapping.file : name of the mapping file for fields (default
>>>> solrindex-mapping.xml)
>>>> solr.auth : use authentication (default false)
>>>> solr.auth.username : use authentication (default false)
>>>> solr.auth : username for authentication
>>>> solr.auth.password : password for authentication
>>>>
>>>>
>>>> 2015-04-07 20:10:56,772 INFO indexer.IndexerMapReduce -
>>>> IndexerMapReduce:
>>>> crawldb: crawl/crawldb
>>>> 2015-04-07 20:10:56,772 INFO indexer.IndexerMapReduce -
>>>> IndexerMapReduce:
>>>> linkdb: crawl/linkdb
>>>> 2015-04-07 20:10:56,772 INFO indexer.IndexerMapReduce -
>>>> IndexerMapReduces: adding segment: crawl/segments/20150406231502
>>>> 2015-04-07 20:10:57,205 WARN util.NativeCodeLoader - Unable to load
>>>> native-hadoop library for your platform... using builtin-java classes
>>>> where
>>>> applicable
>>>> 2015-04-07 20:10:58,020 INFO anchor.AnchorIndexingFilter - Anchor
>>>> deduplication is: off
>>>> 2015-04-07 20:10:58,134 INFO regex.RegexURLNormalizer - can't find rules
>>>> for scope 'indexer', using default
>>>> 2015-04-07 20:11:00,114 INFO regex.RegexURLNormalizer - can't find rules
>>>> for scope 'indexer', using default
>>>> 2015-04-07 20:11:01,205 INFO regex.RegexURLNormalizer - can't find rules
>>>> for scope 'indexer', using default
>>>> 2015-04-07 20:11:01,344 INFO regex.RegexURLNormalizer - can't find rules
>>>> for scope 'indexer', using default
>>>> 2015-04-07 20:11:01,577 INFO regex.RegexURLNormalizer - can't find rules
>>>> for scope 'indexer', using default
>>>> 2015-04-07 20:11:01,788 INFO regex.RegexURLNormalizer - can't find rules
>>>> for scope 'indexer', using default
>>>> 2015-04-07 20:11:01,921 INFO indexer.IndexWriters - Adding
>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter
>>>> 2015-04-07 20:11:01,986 INFO solr.SolrMappingReader - source: content
>>>> dest: content
>>>> 2015-04-07 20:11:01,986 INFO solr.SolrMappingReader - source: title
>>>> dest:
>>>> title
>>>> 2015-04-07 20:11:01,986 INFO solr.SolrMappingReader - source: host dest:
>>>> host
>>>> 2015-04-07 20:11:01,986 INFO solr.SolrMappingReader - source: segment
>>>> dest: segment
>>>> 2015-04-07 20:11:01,987 INFO solr.SolrMappingReader - source: boost
>>>> dest:
>>>> boost
>>>> 2015-04-07 20:11:01,987 INFO solr.SolrMappingReader - source: digest
>>>> dest: digest
>>>> 2015-04-07 20:11:01,987 INFO solr.SolrMappingReader - source: tstamp
>>>> dest: tstamp
>>>> 2015-04-07 20:11:02,266 INFO solr.SolrIndexWriter - Indexing 250
>>>> documents
>>>> 2015-04-07 20:11:02,267 INFO solr.SolrIndexWriter - Deleting 0 documents
>>>> 2015-04-07 20:11:02,512 INFO solr.SolrIndexWriter - Indexing 250
>>>> documents
>>>> *2015-04-07 20:11:02,576 WARN mapred.LocalJobRunner -
>>>> job_local1831338118_0001*
>>>> *org.apache.solr.common.SolrException: HTTP method POST is not supported
>>>> by this URL*
>>>> *
>>>> *
>>>> *HTTP method POST is not supported by this URL*
>>>>
>>>> request: http://localhost:8983/solr/
>>>> at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.
>>>> request(CommonsHttpSolrServer.java:430)
>>>> at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.
>>>> request(CommonsHttpSolrServer.java:244)
>>>> at org.apache.solr.client.solrj.request.AbstractUpdateRequest.
>>>> process(AbstractUpdateRequest.java:105)
>>>> at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(
>>>> SolrIndexWriter.java:135)
>>>> at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:88)
>>>> at org.apache.nutch.indexer.IndexerOutputFormat$1.write(
>>>> IndexerOutputFormat.java:50)
>>>> at org.apache.nutch.indexer.IndexerOutputFormat$1.write(
>>>> IndexerOutputFormat.java:41)
>>>> at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(
>>>> ReduceTask.java:458)
>>>> at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:500)
>>>> at org.apache.nutch.indexer.IndexerMapReduce.reduce(
>>>> IndexerMapReduce.java:323)
>>>> at org.apache.nutch.indexer.IndexerMapReduce.reduce(
>>>> IndexerMapReduce.java:53)
>>>> at org.apache.hadoop.mapred.ReduceTask.runOldReducer(
>>>> ReduceTask.java:522)
>>>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)
>>>> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(
>>>> LocalJobRunner.java:398)
>>>> 2015-04-07 20:11:02,724 ERROR indexer.IndexingJob - Indexer:
>>>> java.io.IOException: Job failed!
>>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>>>> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
>>>> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
>>>>
>>>>
>>>> On Tuesday 07 April 2015 06:53 PM, Jeff Cocking wrote:
>>>>
>>>> The command you are using is not pointing to the specific solr index you
>>>>> created. The http://localhost:8983/solr needs to be changed to the URL
>>>>> for
>>>>> the core created. It should look like
>>>>> http://localhost:8983/solr/#/new_core_name.
>>>>>
>>>>>
>>>>> On Tue, Apr 7, 2015 at 2:33 AM, Anchit Jain <an...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> I followed instructions as given on your blog and created a new core
>>>>> for
>>>>>
>>>>>> nutch data and copied schema.xml of nutch into it.
>>>>>> Then I run the following command in nutch working directory
>>>>>> bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/ -linkdb
>>>>>> crawl/linkdb/ crawl/segments/20150406231502/ -filter -normalize
>>>>>>
>>>>>> But then also the same error is coming as like previous runs.
>>>>>>
>>>>>> On Tue, 7 Apr 2015 at 10:00 Jeff Cocking <je...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> Solr5 is multicore by default. You have not finished the install by
>>>>>>
>>>>>>> setting up solr5's core. I would suggest you look at the link I sent
>>>>>>> to
>>>>>>> finish up your setup.
>>>>>>>
>>>>>>> After you finish your install your solr URL will be
>>>>>>> http://localhost:8983/solr/#/core_name.
>>>>>>>
>>>>>>> Jeff Cocking
>>>>>>>
>>>>>>> I apologize for my brevity.
>>>>>>> This was sent from my mobile device while I should be focusing on
>>>>>>> something else.....
>>>>>>> Like a meeting, driving, family, etc.
>>>>>>>
>>>>>>> On Apr 6, 2015, at 11:16 PM, Anchit Jain <an...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>> I have already installed Solr.I want to integrate it with nutch.
>>>>>>>> Whenever I try to issue this command to nutch
>>>>>>>> ""bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/
>>>>>>>>
>>>>>>>> -linkdb
>>>>>>> crawl/linkdb/ crawl/segments/20150406231502/ -filter -normalize"
>>>>>>>
>>>>>>>> I always get a error
>>>>>>>>
>>>>>>>> Indexer: java.io.IOException: Job failed!
>>>>>>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>>>>>>>> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
>>>>>>>> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
>>>>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>>>>> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Here is the complete hadoop log for the process.I have underlined the
>>>>>>>>
>>>>>>>> error
>>>>>>> part in it.
>>>>>>>> 2015-04-07 09:38:06,613 INFO indexer.IndexingJob - Indexer: starting
>>>>>>>>
>>>>>>>> at
>>>>>>> 2015-04-07 09:38:06
>>>>>>>
>>>>>>>> 2015-04-07 09:38:06,684 INFO indexer.IndexingJob - Indexer: deleting
>>>>>>>>
>>>>>>>> gone
>>>>>>> documents: false
>>>>>>>> 2015-04-07 09:38:06,685 INFO indexer.IndexingJob - Indexer: URL
>>>>>>>>
>>>>>>>> filtering:
>>>>>>> true
>>>>>>>> 2015-04-07 09:38:06,685 INFO indexer.IndexingJob - Indexer: URL
>>>>>>>> normalizing: true
>>>>>>>> 2015-04-07 09:38:06,893 INFO indexer.IndexWriters - Adding
>>>>>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter
>>>>>>>> 2015-04-07 09:38:06,893 INFO indexer.IndexingJob - Active
>>>>>>>>
>>>>>>>> IndexWriters :
>>>>>>> SOLRIndexWriter
>>>>>>>
>>>>>>>> solr.server.url : URL of the SOLR instance (mandatory)
>>>>>>>> solr.commit.size : buffer size when sending to SOLR (default 1000)
>>>>>>>> solr.mapping.file : name of the mapping file for fields (default
>>>>>>>> solrindex-mapping.xml)
>>>>>>>> solr.auth : use authentication (default false)
>>>>>>>> solr.auth.username : use authentication (default false)
>>>>>>>> solr.auth : username for authentication
>>>>>>>> solr.auth.password : password for authentication
>>>>>>>>
>>>>>>>>
>>>>>>>> 2015-04-07 09:38:06,898 INFO indexer.IndexerMapReduce -
>>>>>>>>
>>>>>>>> IndexerMapReduce:
>>>>>>> crawldb: crawl/crawldb
>>>>>>>> 2015-04-07 09:38:06,898 INFO indexer.IndexerMapReduce -
>>>>>>>>
>>>>>>>> IndexerMapReduce:
>>>>>>> linkdb: crawl/linkdb
>>>>>>>> 2015-04-07 09:38:06,898 INFO indexer.IndexerMapReduce -
>>>>>>>>
>>>>>>>> IndexerMapReduces:
>>>>>>> adding segment: crawl/segments/20150406231502
>>>>>>>> 2015-04-07 09:38:07,036 WARN util.NativeCodeLoader - Unable to load
>>>>>>>> native-hadoop library for your platform... using builtin-java classes
>>>>>>>>
>>>>>>>> where
>>>>>>> applicable
>>>>>>>> 2015-04-07 09:38:07,540 INFO anchor.AnchorIndexingFilter - Anchor
>>>>>>>> deduplication is: off
>>>>>>>> 2015-04-07 09:38:07,565 INFO regex.RegexURLNormalizer - can't find
>>>>>>>>
>>>>>>>> rules
>>>>>>> for scope 'indexer', using default
>>>>>>>
>>>>>>>> 2015-04-07 09:38:09,552 INFO regex.RegexURLNormalizer - can't find
>>>>>>>>
>>>>>>>> rules
>>>>>>> for scope 'indexer', using default
>>>>>>>
>>>>>>>> 2015-04-07 09:38:10,642 INFO regex.RegexURLNormalizer - can't find
>>>>>>>>
>>>>>>>> rules
>>>>>>> for scope 'indexer', using default
>>>>>>>
>>>>>>>> 2015-04-07 09:38:10,734 INFO regex.RegexURLNormalizer - can't find
>>>>>>>>
>>>>>>>> rules
>>>>>>> for scope 'indexer', using default
>>>>>>>
>>>>>>>> 2015-04-07 09:38:10,895 INFO regex.RegexURLNormalizer - can't find
>>>>>>>>
>>>>>>>> rules
>>>>>>> for scope 'indexer', using default
>>>>>>>
>>>>>>>> 2015-04-07 09:38:11,088 INFO regex.RegexURLNormalizer - can't find
>>>>>>>>
>>>>>>>> rules
>>>>>>> for scope 'indexer', using default
>>>>>>>
>>>>>>>> 2015-04-07 09:38:11,219 INFO indexer.IndexWriters - Adding
>>>>>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter
>>>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source:
>>>>>>>> content
>>>>>>>> dest: content
>>>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: title
>>>>>>>>
>>>>>>>> dest:
>>>>>>> title
>>>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: host
>>>>>>>>
>>>>>>>> dest:
>>>>>>> host
>>>>>>>
>>>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source:
>>>>>>>> segment
>>>>>>>> dest: segment
>>>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: boost
>>>>>>>>
>>>>>>>> dest:
>>>>>>> boost
>>>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: digest
>>>>>>>>
>>>>>>>> dest:
>>>>>>> digest
>>>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: tstamp
>>>>>>>>
>>>>>>>> dest:
>>>>>>> tstamp
>>>>>>>> 2015-04-07 09:38:11,526 INFO solr.SolrIndexWriter - Indexing 250
>>>>>>>>
>>>>>>>> documents
>>>>>>> 2015-04-07 09:38:11,526 INFO solr.SolrIndexWriter - Deleting 0
>>>>>>>> documents
>>>>>>> 2015-04-07 09:38:11,644 INFO solr.SolrIndexWriter - Indexing 250
>>>>>>> documents
>>>>>>>
>>>>>>> *2015-04-07 09:38:11,699 WARN mapred.LocalJobRunner -
>>>>>>>> job_local1245074757_0001*
>>>>>>>> *org.apache.solr.common.SolrException: Not Found*
>>>>>>>>
>>>>>>>> *Not Found*
>>>>>>>>
>>>>>>>> *request: http://localhost:8983/solr/update?wt=javabin&version=2
>>>>>>>> <http://localhost:8983/solr/update?wt=javabin&version=2>*
>>>>>>>> * at
>>>>>>>> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.
>>>>>>>>
>>>>>>>> request(CommonsHttpSolrServer.java:430)*
>>>>>>> * at
>>>>>>>> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.
>>>>>>>>
>>>>>>>> request(CommonsHttpSolrServer.java:244)*
>>>>>>> * at
>>>>>>>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.
>>>>>>>>
>>>>>>>> process(AbstractUpdateRequest.java:105)*
>>>>>>> * at
>>>>>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(
>>>>>>>>
>>>>>>>> SolrIndexWriter.java:135)*
>>>>>>> * at org.apache.nutch.indexer.IndexWriters.write(
>>>>>>>> IndexWriters.java:88)*
>>>>>>>> * at
>>>>>>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(
>>>>>>>>
>>>>>>>> IndexerOutputFormat.java:50)*
>>>>>>> * at
>>>>>>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(
>>>>>>>>
>>>>>>>> IndexerOutputFormat.java:41)*
>>>>>>> * at
>>>>>>>> org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(
>>>>>>>>
>>>>>>>> ReduceTask.java:458)*
>>>>>>> * at
>>>>>>>> org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:500)*
>>>>>>> * at
>>>>>>>
>>>>>>>> org.apache.nutch.indexer.IndexerMapReduce.reduce(
>>>>>>>>
>>>>>>>> IndexerMapReduce.java:323)*
>>>>>>> * at
>>>>>>>> org.apache.nutch.indexer.IndexerMapReduce.reduce(
>>>>>>>>
>>>>>>>> IndexerMapReduce.java:53)*
>>>>>>> * at org.apache.hadoop.mapred.ReduceTask.runOldReducer(
>>>>>>>> ReduceTask.java:522)*
>>>>>>> * at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)*
>>>>>>>> * at
>>>>>>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(
>>>>>>>>
>>>>>>>> LocalJobRunner.java:398)*
>>>>>>> *2015-04-07 09:38:12,408 ERROR indexer.IndexingJob - Indexer:
>>>>>>>> java.io.IOException: Job failed!*
>>>>>>>> * at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)*
>>>>>>>> * at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.
>>>>>>>> java:114)*
>>>>>>>> * at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)*
>>>>>>>> * at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)*
>>>>>>>> * at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.
>>>>>>>> java:186)*
>>>>>>>>
>>>>>>>> On Tue, 7 Apr 2015 at 03:18 Jeff Cocking <je...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>> With Solr5.0.0 you can skip that step. Solr will auto create your
>>>>>>>> schema
>>>>>>>> document based on the data being provided.
>>>>>>>>
>>>>>>>>> One of the new features with Solr5 is the install/service feature. I
>>>>>>>>>
>>>>>>>>> did a
>>>>>>>> quick write up on how to install Solr5 on Centos. Might be something
>>>>>>>>
>>>>>>>>> useful there for you.
>>>>>>>>>
>>>>>>>>> http://www.cocking.com/apache-solr-5-0-install-on-centos-7/
>>>>>>>>>
>>>>>>>>> jeff
>>>>>>>>>
>>>>>>>>> On Mon, Apr 6, 2015 at 3:13 PM, Anchit Jain <
>>>>>>>>> anchitjain1234@gmail.com
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> I want to index nutch results using *Solr 5.0* but as mentioned in
>>>>>>>>>
>>>>>>>>>> https://wiki.apache.org/nutch/NutchTutorial there is no directory
>>>>>>>>>> ${APACHE_SOLR_HOME}/example/solr/collection1/conf/
>>>>>>>>>> in solr 5.0 . So where I have to copy *schema.xml*?
>>>>>>>>>> Also there is no *start.jar* present in example directory.
>>>>>>>>>>
>>>>>>>>>>
Re: Nutch 1.9 integration with Solr 5.0.0
Posted by Jeff Cocking <je...@gmail.com>.
I use the following for all my indexing work.
Usage: bin/crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds>
Example: bin/crawl urls/ TestCrawl/ http://localhost:8983/solr/ 2
On Tue, Apr 7, 2015 at 11:20 AM, Anchit Jain <an...@gmail.com>
wrote:
> Yes it is working correctly from the browser. I can also manually add the
> documents from web browser.But not through nutch.
> I am not able to figure out where the problem is.
>
> Is there any manual way of adding crawldb and linkdb to the nutch besides
> that command?
>
>
> On Tuesday 07 April 2015 09:47 PM, Jeff Cocking wrote:
>
>> There can be numerous reasons....Hosts.conf, firewall, etc. These are all
>> unique to your system.
>>
>> Have you viewed the solr admin panel via a browser? This is a critical
>> step in the installation. This validates SOLR can accept HTTP commands.
>>
>> On Tue, Apr 7, 2015 at 9:53 AM, Anchit Jain <an...@gmail.com>
>> wrote:
>>
>> I created a new core named *foo*. Than I copied the *schema.xml* from
>>> *nutch* into *var/solr/data/foo/conf* with changes as described in*
>>> https://wiki.apache.org/nutch/NutchTutorial*.
>>> I changed the url to*http://localhost:8983/solr/#/foo*
>>> so new command is
>>> "*bin/nutch solrindex http://localhost:8983/solr/#/foo crawl/crawldb/
>>> -linkdb crawl/linkdb/ crawl/segments/20150406231502/ -filter -normalize*"
>>> But now I am getting error
>>> *org.apache.solr.common.SolrException: HTTP method POST is not supported
>>> by this URL*
>>> *
>>> *
>>> Is some other change is also required in URL to support POST requests?
>>>
>>> Full log
>>>
>>> 2015-04-07 20:10:56,068 INFO indexer.IndexingJob - Indexer: starting at
>>> 2015-04-07 20:10:56
>>> 2015-04-07 20:10:56,178 INFO indexer.IndexingJob - Indexer: deleting
>>> gone
>>> documents: false
>>> 2015-04-07 20:10:56,178 INFO indexer.IndexingJob - Indexer: URL
>>> filtering: true
>>> 2015-04-07 20:10:56,178 INFO indexer.IndexingJob - Indexer: URL
>>> normalizing: true
>>> 2015-04-07 20:10:56,727 INFO indexer.IndexWriters - Adding
>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter
>>> 2015-04-07 20:10:56,727 INFO indexer.IndexingJob - Active IndexWriters :
>>> SOLRIndexWriter
>>> solr.server.url : URL of the SOLR instance (mandatory)
>>> solr.commit.size : buffer size when sending to SOLR (default 1000)
>>> solr.mapping.file : name of the mapping file for fields (default
>>> solrindex-mapping.xml)
>>> solr.auth : use authentication (default false)
>>> solr.auth.username : use authentication (default false)
>>> solr.auth : username for authentication
>>> solr.auth.password : password for authentication
>>>
>>>
>>> 2015-04-07 20:10:56,772 INFO indexer.IndexerMapReduce -
>>> IndexerMapReduce:
>>> crawldb: crawl/crawldb
>>> 2015-04-07 20:10:56,772 INFO indexer.IndexerMapReduce -
>>> IndexerMapReduce:
>>> linkdb: crawl/linkdb
>>> 2015-04-07 20:10:56,772 INFO indexer.IndexerMapReduce -
>>> IndexerMapReduces: adding segment: crawl/segments/20150406231502
>>> 2015-04-07 20:10:57,205 WARN util.NativeCodeLoader - Unable to load
>>> native-hadoop library for your platform... using builtin-java classes
>>> where
>>> applicable
>>> 2015-04-07 20:10:58,020 INFO anchor.AnchorIndexingFilter - Anchor
>>> deduplication is: off
>>> 2015-04-07 20:10:58,134 INFO regex.RegexURLNormalizer - can't find rules
>>> for scope 'indexer', using default
>>> 2015-04-07 20:11:00,114 INFO regex.RegexURLNormalizer - can't find rules
>>> for scope 'indexer', using default
>>> 2015-04-07 20:11:01,205 INFO regex.RegexURLNormalizer - can't find rules
>>> for scope 'indexer', using default
>>> 2015-04-07 20:11:01,344 INFO regex.RegexURLNormalizer - can't find rules
>>> for scope 'indexer', using default
>>> 2015-04-07 20:11:01,577 INFO regex.RegexURLNormalizer - can't find rules
>>> for scope 'indexer', using default
>>> 2015-04-07 20:11:01,788 INFO regex.RegexURLNormalizer - can't find rules
>>> for scope 'indexer', using default
>>> 2015-04-07 20:11:01,921 INFO indexer.IndexWriters - Adding
>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter
>>> 2015-04-07 20:11:01,986 INFO solr.SolrMappingReader - source: content
>>> dest: content
>>> 2015-04-07 20:11:01,986 INFO solr.SolrMappingReader - source: title
>>> dest:
>>> title
>>> 2015-04-07 20:11:01,986 INFO solr.SolrMappingReader - source: host dest:
>>> host
>>> 2015-04-07 20:11:01,986 INFO solr.SolrMappingReader - source: segment
>>> dest: segment
>>> 2015-04-07 20:11:01,987 INFO solr.SolrMappingReader - source: boost
>>> dest:
>>> boost
>>> 2015-04-07 20:11:01,987 INFO solr.SolrMappingReader - source: digest
>>> dest: digest
>>> 2015-04-07 20:11:01,987 INFO solr.SolrMappingReader - source: tstamp
>>> dest: tstamp
>>> 2015-04-07 20:11:02,266 INFO solr.SolrIndexWriter - Indexing 250
>>> documents
>>> 2015-04-07 20:11:02,267 INFO solr.SolrIndexWriter - Deleting 0 documents
>>> 2015-04-07 20:11:02,512 INFO solr.SolrIndexWriter - Indexing 250
>>> documents
>>> *2015-04-07 20:11:02,576 WARN mapred.LocalJobRunner -
>>> job_local1831338118_0001*
>>> *org.apache.solr.common.SolrException: HTTP method POST is not supported
>>> by this URL*
>>> *
>>> *
>>> *HTTP method POST is not supported by this URL*
>>>
>>> request: http://localhost:8983/solr/
>>> at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.
>>> request(CommonsHttpSolrServer.java:430)
>>> at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.
>>> request(CommonsHttpSolrServer.java:244)
>>> at org.apache.solr.client.solrj.request.AbstractUpdateRequest.
>>> process(AbstractUpdateRequest.java:105)
>>> at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(
>>> SolrIndexWriter.java:135)
>>> at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:88)
>>> at org.apache.nutch.indexer.IndexerOutputFormat$1.write(
>>> IndexerOutputFormat.java:50)
>>> at org.apache.nutch.indexer.IndexerOutputFormat$1.write(
>>> IndexerOutputFormat.java:41)
>>> at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(
>>> ReduceTask.java:458)
>>> at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:500)
>>> at org.apache.nutch.indexer.IndexerMapReduce.reduce(
>>> IndexerMapReduce.java:323)
>>> at org.apache.nutch.indexer.IndexerMapReduce.reduce(
>>> IndexerMapReduce.java:53)
>>> at org.apache.hadoop.mapred.ReduceTask.runOldReducer(
>>> ReduceTask.java:522)
>>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)
>>> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(
>>> LocalJobRunner.java:398)
>>> 2015-04-07 20:11:02,724 ERROR indexer.IndexingJob - Indexer:
>>> java.io.IOException: Job failed!
>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>>> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
>>> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
>>>
>>>
>>> On Tuesday 07 April 2015 06:53 PM, Jeff Cocking wrote:
>>>
>>> The command you are using is not pointing to the specific solr index you
>>>> created. The http://localhost:8983/solr needs to be changed to the URL
>>>> for
>>>> the core created. It should look like
>>>> http://localhost:8983/solr/#/new_core_name.
>>>>
>>>>
>>>> On Tue, Apr 7, 2015 at 2:33 AM, Anchit Jain <an...@gmail.com>
>>>> wrote:
>>>>
>>>> I followed instructions as given on your blog and created a new core
>>>> for
>>>>
>>>>> nutch data and copied schema.xml of nutch into it.
>>>>> Then I run the following command in nutch working directory
>>>>> bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/ -linkdb
>>>>> crawl/linkdb/ crawl/segments/20150406231502/ -filter -normalize
>>>>>
>>>>> But then also the same error is coming as like previous runs.
>>>>>
>>>>> On Tue, 7 Apr 2015 at 10:00 Jeff Cocking <je...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> Solr5 is multicore by default. You have not finished the install by
>>>>>
>>>>>> setting up solr5's core. I would suggest you look at the link I sent
>>>>>> to
>>>>>> finish up your setup.
>>>>>>
>>>>>> After you finish your install your solr URL will be
>>>>>> http://localhost:8983/solr/#/core_name.
>>>>>>
>>>>>> Jeff Cocking
>>>>>>
>>>>>> I apologize for my brevity.
>>>>>> This was sent from my mobile device while I should be focusing on
>>>>>> something else.....
>>>>>> Like a meeting, driving, family, etc.
>>>>>>
>>>>>> On Apr 6, 2015, at 11:16 PM, Anchit Jain <an...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>> I have already installed Solr.I want to integrate it with nutch.
>>>>>>> Whenever I try to issue this command to nutch
>>>>>>> ""bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/
>>>>>>>
>>>>>>> -linkdb
>>>>>> crawl/linkdb/ crawl/segments/20150406231502/ -filter -normalize"
>>>>>>
>>>>>>> I always get a error
>>>>>>>
>>>>>>> Indexer: java.io.IOException: Job failed!
>>>>>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>>>>>>> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
>>>>>>> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
>>>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>>>> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Here is the complete hadoop log for the process.I have underlined the
>>>>>>>
>>>>>>> error
>>>>>>
>>>>>> part in it.
>>>>>>>
>>>>>>> 2015-04-07 09:38:06,613 INFO indexer.IndexingJob - Indexer: starting
>>>>>>>
>>>>>>> at
>>>>>> 2015-04-07 09:38:06
>>>>>>
>>>>>>> 2015-04-07 09:38:06,684 INFO indexer.IndexingJob - Indexer: deleting
>>>>>>>
>>>>>>> gone
>>>>>>
>>>>>> documents: false
>>>>>>> 2015-04-07 09:38:06,685 INFO indexer.IndexingJob - Indexer: URL
>>>>>>>
>>>>>>> filtering:
>>>>>>
>>>>>> true
>>>>>>> 2015-04-07 09:38:06,685 INFO indexer.IndexingJob - Indexer: URL
>>>>>>> normalizing: true
>>>>>>> 2015-04-07 09:38:06,893 INFO indexer.IndexWriters - Adding
>>>>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter
>>>>>>> 2015-04-07 09:38:06,893 INFO indexer.IndexingJob - Active
>>>>>>>
>>>>>>> IndexWriters :
>>>>>> SOLRIndexWriter
>>>>>>
>>>>>>> solr.server.url : URL of the SOLR instance (mandatory)
>>>>>>> solr.commit.size : buffer size when sending to SOLR (default 1000)
>>>>>>> solr.mapping.file : name of the mapping file for fields (default
>>>>>>> solrindex-mapping.xml)
>>>>>>> solr.auth : use authentication (default false)
>>>>>>> solr.auth.username : use authentication (default false)
>>>>>>> solr.auth : username for authentication
>>>>>>> solr.auth.password : password for authentication
>>>>>>>
>>>>>>>
>>>>>>> 2015-04-07 09:38:06,898 INFO indexer.IndexerMapReduce -
>>>>>>>
>>>>>>> IndexerMapReduce:
>>>>>>
>>>>>> crawldb: crawl/crawldb
>>>>>>> 2015-04-07 09:38:06,898 INFO indexer.IndexerMapReduce -
>>>>>>>
>>>>>>> IndexerMapReduce:
>>>>>>
>>>>>> linkdb: crawl/linkdb
>>>>>>> 2015-04-07 09:38:06,898 INFO indexer.IndexerMapReduce -
>>>>>>>
>>>>>>> IndexerMapReduces:
>>>>>>
>>>>>> adding segment: crawl/segments/20150406231502
>>>>>>> 2015-04-07 09:38:07,036 WARN util.NativeCodeLoader - Unable to load
>>>>>>> native-hadoop library for your platform... using builtin-java classes
>>>>>>>
>>>>>>> where
>>>>>>
>>>>>> applicable
>>>>>>> 2015-04-07 09:38:07,540 INFO anchor.AnchorIndexingFilter - Anchor
>>>>>>> deduplication is: off
>>>>>>> 2015-04-07 09:38:07,565 INFO regex.RegexURLNormalizer - can't find
>>>>>>>
>>>>>>> rules
>>>>>> for scope 'indexer', using default
>>>>>>
>>>>>>> 2015-04-07 09:38:09,552 INFO regex.RegexURLNormalizer - can't find
>>>>>>>
>>>>>>> rules
>>>>>> for scope 'indexer', using default
>>>>>>
>>>>>>> 2015-04-07 09:38:10,642 INFO regex.RegexURLNormalizer - can't find
>>>>>>>
>>>>>>> rules
>>>>>> for scope 'indexer', using default
>>>>>>
>>>>>>> 2015-04-07 09:38:10,734 INFO regex.RegexURLNormalizer - can't find
>>>>>>>
>>>>>>> rules
>>>>>> for scope 'indexer', using default
>>>>>>
>>>>>>> 2015-04-07 09:38:10,895 INFO regex.RegexURLNormalizer - can't find
>>>>>>>
>>>>>>> rules
>>>>>> for scope 'indexer', using default
>>>>>>
>>>>>>> 2015-04-07 09:38:11,088 INFO regex.RegexURLNormalizer - can't find
>>>>>>>
>>>>>>> rules
>>>>>> for scope 'indexer', using default
>>>>>>
>>>>>>> 2015-04-07 09:38:11,219 INFO indexer.IndexWriters - Adding
>>>>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter
>>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source:
>>>>>>> content
>>>>>>> dest: content
>>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: title
>>>>>>>
>>>>>>> dest:
>>>>>>
>>>>>> title
>>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: host
>>>>>>>
>>>>>>> dest:
>>>>>> host
>>>>>>
>>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source:
>>>>>>> segment
>>>>>>> dest: segment
>>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: boost
>>>>>>>
>>>>>>> dest:
>>>>>>
>>>>>> boost
>>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: digest
>>>>>>>
>>>>>>> dest:
>>>>>>
>>>>>> digest
>>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: tstamp
>>>>>>>
>>>>>>> dest:
>>>>>>
>>>>>> tstamp
>>>>>>> 2015-04-07 09:38:11,526 INFO solr.SolrIndexWriter - Indexing 250
>>>>>>>
>>>>>>> documents
>>>>>>
>>>>>> 2015-04-07 09:38:11,526 INFO solr.SolrIndexWriter - Deleting 0
>>>>>>>
>>>>>>> documents
>>>>>> 2015-04-07 09:38:11,644 INFO solr.SolrIndexWriter - Indexing 250
>>>>>> documents
>>>>>>
>>>>>> *2015-04-07 09:38:11,699 WARN mapred.LocalJobRunner -
>>>>>>> job_local1245074757_0001*
>>>>>>> *org.apache.solr.common.SolrException: Not Found*
>>>>>>>
>>>>>>> *Not Found*
>>>>>>>
>>>>>>> *request: http://localhost:8983/solr/update?wt=javabin&version=2
>>>>>>> <http://localhost:8983/solr/update?wt=javabin&version=2>*
>>>>>>> * at
>>>>>>> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.
>>>>>>>
>>>>>>> request(CommonsHttpSolrServer.java:430)*
>>>>>>
>>>>>> * at
>>>>>>> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.
>>>>>>>
>>>>>>> request(CommonsHttpSolrServer.java:244)*
>>>>>>
>>>>>> * at
>>>>>>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.
>>>>>>>
>>>>>>> process(AbstractUpdateRequest.java:105)*
>>>>>>
>>>>>> * at
>>>>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(
>>>>>>>
>>>>>>> SolrIndexWriter.java:135)*
>>>>>>
>>>>>> * at org.apache.nutch.indexer.IndexWriters.write(
>>>>>>> IndexWriters.java:88)*
>>>>>>> * at
>>>>>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(
>>>>>>>
>>>>>>> IndexerOutputFormat.java:50)*
>>>>>>
>>>>>> * at
>>>>>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(
>>>>>>>
>>>>>>> IndexerOutputFormat.java:41)*
>>>>>>
>>>>>> * at
>>>>>>> org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(
>>>>>>>
>>>>>>> ReduceTask.java:458)*
>>>>>>
>>>>>> * at
>>>>>>>
>>>>>>> org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:500)*
>>>>>> * at
>>>>>>
>>>>>>> org.apache.nutch.indexer.IndexerMapReduce.reduce(
>>>>>>>
>>>>>>> IndexerMapReduce.java:323)*
>>>>>>
>>>>>> * at
>>>>>>> org.apache.nutch.indexer.IndexerMapReduce.reduce(
>>>>>>>
>>>>>>> IndexerMapReduce.java:53)*
>>>>>>
>>>>>> * at org.apache.hadoop.mapred.ReduceTask.runOldReducer(
>>>>>>>
>>>>>>> ReduceTask.java:522)*
>>>>>>
>>>>>> * at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)*
>>>>>>> * at
>>>>>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(
>>>>>>>
>>>>>>> LocalJobRunner.java:398)*
>>>>>>
>>>>>> *2015-04-07 09:38:12,408 ERROR indexer.IndexingJob - Indexer:
>>>>>>> java.io.IOException: Job failed!*
>>>>>>> * at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)*
>>>>>>> * at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.
>>>>>>> java:114)*
>>>>>>> * at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)*
>>>>>>> * at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)*
>>>>>>> * at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.
>>>>>>> java:186)*
>>>>>>>
>>>>>>> On Tue, 7 Apr 2015 at 03:18 Jeff Cocking <je...@gmail.com>
>>>>>>>>
>>>>>>>> wrote:
>>>>>>> With Solr5.0.0 you can skip that step. Solr will auto create your
>>>>>>> schema
>>>>>>> document based on the data being provided.
>>>>>>>
>>>>>>>> One of the new features with Solr5 is the install/service feature. I
>>>>>>>>
>>>>>>>> did a
>>>>>>> quick write up on how to install Solr5 on Centos. Might be something
>>>>>>>
>>>>>>>> useful there for you.
>>>>>>>>
>>>>>>>> http://www.cocking.com/apache-solr-5-0-install-on-centos-7/
>>>>>>>>
>>>>>>>> jeff
>>>>>>>>
>>>>>>>> On Mon, Apr 6, 2015 at 3:13 PM, Anchit Jain <
>>>>>>>> anchitjain1234@gmail.com
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>> I want to index nutch results using *Solr 5.0* but as mentioned in
>>>>>>>>
>>>>>>>>> https://wiki.apache.org/nutch/NutchTutorial there is no directory
>>>>>>>>> ${APACHE_SOLR_HOME}/example/solr/collection1/conf/
>>>>>>>>> in solr 5.0 . So where I have to copy *schema.xml*?
>>>>>>>>> Also there is no *start.jar* present in example directory.
>>>>>>>>>
>>>>>>>>>
>
Re: Nutch 1.9 integration with Solr 5.0.0
Posted by Anchit Jain <an...@gmail.com>.
Yes it is working correctly from the browser. I can also manually add
the documents from web browser.But not through nutch.
I am not able to figure out where the problem is.
Is there any manual way of adding crawldb and linkdb to the nutch
besides that command?
On Tuesday 07 April 2015 09:47 PM, Jeff Cocking wrote:
> There can be numerous reasons....Hosts.conf, firewall, etc. These are all
> unique to your system.
>
> Have you viewed the solr admin panel via a browser? This is a critical
> step in the installation. This validates SOLR can accept HTTP commands.
>
> On Tue, Apr 7, 2015 at 9:53 AM, Anchit Jain <an...@gmail.com>
> wrote:
>
>> I created a new core named *foo*. Than I copied the *schema.xml* from
>> *nutch* into *var/solr/data/foo/conf* with changes as described in*
>> https://wiki.apache.org/nutch/NutchTutorial*.
>> I changed the url to*http://localhost:8983/solr/#/foo*
>> so new command is
>> "*bin/nutch solrindex http://localhost:8983/solr/#/foo crawl/crawldb/
>> -linkdb crawl/linkdb/ crawl/segments/20150406231502/ -filter -normalize*"
>> But now I am getting error
>> *org.apache.solr.common.SolrException: HTTP method POST is not supported
>> by this URL*
>> *
>> *
>> Is some other change is also required in URL to support POST requests?
>>
>> Full log
>>
>> 2015-04-07 20:10:56,068 INFO indexer.IndexingJob - Indexer: starting at
>> 2015-04-07 20:10:56
>> 2015-04-07 20:10:56,178 INFO indexer.IndexingJob - Indexer: deleting gone
>> documents: false
>> 2015-04-07 20:10:56,178 INFO indexer.IndexingJob - Indexer: URL
>> filtering: true
>> 2015-04-07 20:10:56,178 INFO indexer.IndexingJob - Indexer: URL
>> normalizing: true
>> 2015-04-07 20:10:56,727 INFO indexer.IndexWriters - Adding
>> org.apache.nutch.indexwriter.solr.SolrIndexWriter
>> 2015-04-07 20:10:56,727 INFO indexer.IndexingJob - Active IndexWriters :
>> SOLRIndexWriter
>> solr.server.url : URL of the SOLR instance (mandatory)
>> solr.commit.size : buffer size when sending to SOLR (default 1000)
>> solr.mapping.file : name of the mapping file for fields (default
>> solrindex-mapping.xml)
>> solr.auth : use authentication (default false)
>> solr.auth.username : use authentication (default false)
>> solr.auth : username for authentication
>> solr.auth.password : password for authentication
>>
>>
>> 2015-04-07 20:10:56,772 INFO indexer.IndexerMapReduce - IndexerMapReduce:
>> crawldb: crawl/crawldb
>> 2015-04-07 20:10:56,772 INFO indexer.IndexerMapReduce - IndexerMapReduce:
>> linkdb: crawl/linkdb
>> 2015-04-07 20:10:56,772 INFO indexer.IndexerMapReduce -
>> IndexerMapReduces: adding segment: crawl/segments/20150406231502
>> 2015-04-07 20:10:57,205 WARN util.NativeCodeLoader - Unable to load
>> native-hadoop library for your platform... using builtin-java classes where
>> applicable
>> 2015-04-07 20:10:58,020 INFO anchor.AnchorIndexingFilter - Anchor
>> deduplication is: off
>> 2015-04-07 20:10:58,134 INFO regex.RegexURLNormalizer - can't find rules
>> for scope 'indexer', using default
>> 2015-04-07 20:11:00,114 INFO regex.RegexURLNormalizer - can't find rules
>> for scope 'indexer', using default
>> 2015-04-07 20:11:01,205 INFO regex.RegexURLNormalizer - can't find rules
>> for scope 'indexer', using default
>> 2015-04-07 20:11:01,344 INFO regex.RegexURLNormalizer - can't find rules
>> for scope 'indexer', using default
>> 2015-04-07 20:11:01,577 INFO regex.RegexURLNormalizer - can't find rules
>> for scope 'indexer', using default
>> 2015-04-07 20:11:01,788 INFO regex.RegexURLNormalizer - can't find rules
>> for scope 'indexer', using default
>> 2015-04-07 20:11:01,921 INFO indexer.IndexWriters - Adding
>> org.apache.nutch.indexwriter.solr.SolrIndexWriter
>> 2015-04-07 20:11:01,986 INFO solr.SolrMappingReader - source: content
>> dest: content
>> 2015-04-07 20:11:01,986 INFO solr.SolrMappingReader - source: title dest:
>> title
>> 2015-04-07 20:11:01,986 INFO solr.SolrMappingReader - source: host dest:
>> host
>> 2015-04-07 20:11:01,986 INFO solr.SolrMappingReader - source: segment
>> dest: segment
>> 2015-04-07 20:11:01,987 INFO solr.SolrMappingReader - source: boost dest:
>> boost
>> 2015-04-07 20:11:01,987 INFO solr.SolrMappingReader - source: digest
>> dest: digest
>> 2015-04-07 20:11:01,987 INFO solr.SolrMappingReader - source: tstamp
>> dest: tstamp
>> 2015-04-07 20:11:02,266 INFO solr.SolrIndexWriter - Indexing 250 documents
>> 2015-04-07 20:11:02,267 INFO solr.SolrIndexWriter - Deleting 0 documents
>> 2015-04-07 20:11:02,512 INFO solr.SolrIndexWriter - Indexing 250 documents
>> *2015-04-07 20:11:02,576 WARN mapred.LocalJobRunner -
>> job_local1831338118_0001*
>> *org.apache.solr.common.SolrException: HTTP method POST is not supported
>> by this URL*
>> *
>> *
>> *HTTP method POST is not supported by this URL*
>>
>> request: http://localhost:8983/solr/
>> at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.
>> request(CommonsHttpSolrServer.java:430)
>> at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.
>> request(CommonsHttpSolrServer.java:244)
>> at org.apache.solr.client.solrj.request.AbstractUpdateRequest.
>> process(AbstractUpdateRequest.java:105)
>> at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(
>> SolrIndexWriter.java:135)
>> at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:88)
>> at org.apache.nutch.indexer.IndexerOutputFormat$1.write(
>> IndexerOutputFormat.java:50)
>> at org.apache.nutch.indexer.IndexerOutputFormat$1.write(
>> IndexerOutputFormat.java:41)
>> at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(
>> ReduceTask.java:458)
>> at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:500)
>> at org.apache.nutch.indexer.IndexerMapReduce.reduce(
>> IndexerMapReduce.java:323)
>> at org.apache.nutch.indexer.IndexerMapReduce.reduce(
>> IndexerMapReduce.java:53)
>> at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:522)
>> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)
>> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(
>> LocalJobRunner.java:398)
>> 2015-04-07 20:11:02,724 ERROR indexer.IndexingJob - Indexer:
>> java.io.IOException: Job failed!
>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
>> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
>>
>>
>> On Tuesday 07 April 2015 06:53 PM, Jeff Cocking wrote:
>>
>>> The command you are using is not pointing to the specific solr index you
>>> created. The http://localhost:8983/solr needs to be changed to the URL
>>> for
>>> the core created. It should look like
>>> http://localhost:8983/solr/#/new_core_name.
>>>
>>>
>>> On Tue, Apr 7, 2015 at 2:33 AM, Anchit Jain <an...@gmail.com>
>>> wrote:
>>>
>>> I followed instructions as given on your blog and created a new core for
>>>> nutch data and copied schema.xml of nutch into it.
>>>> Then I run the following command in nutch working directory
>>>> bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/ -linkdb
>>>> crawl/linkdb/ crawl/segments/20150406231502/ -filter -normalize
>>>>
>>>> But then also the same error is coming as like previous runs.
>>>>
>>>> On Tue, 7 Apr 2015 at 10:00 Jeff Cocking <je...@gmail.com> wrote:
>>>>
>>>> Solr5 is multicore by default. You have not finished the install by
>>>>> setting up solr5's core. I would suggest you look at the link I sent to
>>>>> finish up your setup.
>>>>>
>>>>> After you finish your install your solr URL will be
>>>>> http://localhost:8983/solr/#/core_name.
>>>>>
>>>>> Jeff Cocking
>>>>>
>>>>> I apologize for my brevity.
>>>>> This was sent from my mobile device while I should be focusing on
>>>>> something else.....
>>>>> Like a meeting, driving, family, etc.
>>>>>
>>>>> On Apr 6, 2015, at 11:16 PM, Anchit Jain <an...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I have already installed Solr.I want to integrate it with nutch.
>>>>>> Whenever I try to issue this command to nutch
>>>>>> ""bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/
>>>>>>
>>>>> -linkdb
>>>>> crawl/linkdb/ crawl/segments/20150406231502/ -filter -normalize"
>>>>>> I always get a error
>>>>>>
>>>>>> Indexer: java.io.IOException: Job failed!
>>>>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>>>>>> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
>>>>>> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
>>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>>> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
>>>>>>
>>>>>>
>>>>>>
>>>>>> Here is the complete hadoop log for the process.I have underlined the
>>>>>>
>>>>> error
>>>>>
>>>>>> part in it.
>>>>>>
>>>>>> 2015-04-07 09:38:06,613 INFO indexer.IndexingJob - Indexer: starting
>>>>>>
>>>>> at
>>>>> 2015-04-07 09:38:06
>>>>>> 2015-04-07 09:38:06,684 INFO indexer.IndexingJob - Indexer: deleting
>>>>>>
>>>>> gone
>>>>>
>>>>>> documents: false
>>>>>> 2015-04-07 09:38:06,685 INFO indexer.IndexingJob - Indexer: URL
>>>>>>
>>>>> filtering:
>>>>>
>>>>>> true
>>>>>> 2015-04-07 09:38:06,685 INFO indexer.IndexingJob - Indexer: URL
>>>>>> normalizing: true
>>>>>> 2015-04-07 09:38:06,893 INFO indexer.IndexWriters - Adding
>>>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter
>>>>>> 2015-04-07 09:38:06,893 INFO indexer.IndexingJob - Active
>>>>>>
>>>>> IndexWriters :
>>>>> SOLRIndexWriter
>>>>>> solr.server.url : URL of the SOLR instance (mandatory)
>>>>>> solr.commit.size : buffer size when sending to SOLR (default 1000)
>>>>>> solr.mapping.file : name of the mapping file for fields (default
>>>>>> solrindex-mapping.xml)
>>>>>> solr.auth : use authentication (default false)
>>>>>> solr.auth.username : use authentication (default false)
>>>>>> solr.auth : username for authentication
>>>>>> solr.auth.password : password for authentication
>>>>>>
>>>>>>
>>>>>> 2015-04-07 09:38:06,898 INFO indexer.IndexerMapReduce -
>>>>>>
>>>>> IndexerMapReduce:
>>>>>
>>>>>> crawldb: crawl/crawldb
>>>>>> 2015-04-07 09:38:06,898 INFO indexer.IndexerMapReduce -
>>>>>>
>>>>> IndexerMapReduce:
>>>>>
>>>>>> linkdb: crawl/linkdb
>>>>>> 2015-04-07 09:38:06,898 INFO indexer.IndexerMapReduce -
>>>>>>
>>>>> IndexerMapReduces:
>>>>>
>>>>>> adding segment: crawl/segments/20150406231502
>>>>>> 2015-04-07 09:38:07,036 WARN util.NativeCodeLoader - Unable to load
>>>>>> native-hadoop library for your platform... using builtin-java classes
>>>>>>
>>>>> where
>>>>>
>>>>>> applicable
>>>>>> 2015-04-07 09:38:07,540 INFO anchor.AnchorIndexingFilter - Anchor
>>>>>> deduplication is: off
>>>>>> 2015-04-07 09:38:07,565 INFO regex.RegexURLNormalizer - can't find
>>>>>>
>>>>> rules
>>>>> for scope 'indexer', using default
>>>>>> 2015-04-07 09:38:09,552 INFO regex.RegexURLNormalizer - can't find
>>>>>>
>>>>> rules
>>>>> for scope 'indexer', using default
>>>>>> 2015-04-07 09:38:10,642 INFO regex.RegexURLNormalizer - can't find
>>>>>>
>>>>> rules
>>>>> for scope 'indexer', using default
>>>>>> 2015-04-07 09:38:10,734 INFO regex.RegexURLNormalizer - can't find
>>>>>>
>>>>> rules
>>>>> for scope 'indexer', using default
>>>>>> 2015-04-07 09:38:10,895 INFO regex.RegexURLNormalizer - can't find
>>>>>>
>>>>> rules
>>>>> for scope 'indexer', using default
>>>>>> 2015-04-07 09:38:11,088 INFO regex.RegexURLNormalizer - can't find
>>>>>>
>>>>> rules
>>>>> for scope 'indexer', using default
>>>>>> 2015-04-07 09:38:11,219 INFO indexer.IndexWriters - Adding
>>>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter
>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: content
>>>>>> dest: content
>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: title
>>>>>>
>>>>> dest:
>>>>>
>>>>>> title
>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: host
>>>>>>
>>>>> dest:
>>>>> host
>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: segment
>>>>>> dest: segment
>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: boost
>>>>>>
>>>>> dest:
>>>>>
>>>>>> boost
>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: digest
>>>>>>
>>>>> dest:
>>>>>
>>>>>> digest
>>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: tstamp
>>>>>>
>>>>> dest:
>>>>>
>>>>>> tstamp
>>>>>> 2015-04-07 09:38:11,526 INFO solr.SolrIndexWriter - Indexing 250
>>>>>>
>>>>> documents
>>>>>
>>>>>> 2015-04-07 09:38:11,526 INFO solr.SolrIndexWriter - Deleting 0
>>>>>>
>>>>> documents
>>>>> 2015-04-07 09:38:11,644 INFO solr.SolrIndexWriter - Indexing 250
>>>>> documents
>>>>>
>>>>>> *2015-04-07 09:38:11,699 WARN mapred.LocalJobRunner -
>>>>>> job_local1245074757_0001*
>>>>>> *org.apache.solr.common.SolrException: Not Found*
>>>>>>
>>>>>> *Not Found*
>>>>>>
>>>>>> *request: http://localhost:8983/solr/update?wt=javabin&version=2
>>>>>> <http://localhost:8983/solr/update?wt=javabin&version=2>*
>>>>>> * at
>>>>>> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.
>>>>>>
>>>>> request(CommonsHttpSolrServer.java:430)*
>>>>>
>>>>>> * at
>>>>>> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.
>>>>>>
>>>>> request(CommonsHttpSolrServer.java:244)*
>>>>>
>>>>>> * at
>>>>>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.
>>>>>>
>>>>> process(AbstractUpdateRequest.java:105)*
>>>>>
>>>>>> * at
>>>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(
>>>>>>
>>>>> SolrIndexWriter.java:135)*
>>>>>
>>>>>> * at org.apache.nutch.indexer.IndexWriters.write(
>>>>>> IndexWriters.java:88)*
>>>>>> * at
>>>>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(
>>>>>>
>>>>> IndexerOutputFormat.java:50)*
>>>>>
>>>>>> * at
>>>>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(
>>>>>>
>>>>> IndexerOutputFormat.java:41)*
>>>>>
>>>>>> * at
>>>>>> org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(
>>>>>>
>>>>> ReduceTask.java:458)*
>>>>>
>>>>>> * at
>>>>>>
>>>>> org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:500)*
>>>>> * at
>>>>>> org.apache.nutch.indexer.IndexerMapReduce.reduce(
>>>>>>
>>>>> IndexerMapReduce.java:323)*
>>>>>
>>>>>> * at
>>>>>> org.apache.nutch.indexer.IndexerMapReduce.reduce(
>>>>>>
>>>>> IndexerMapReduce.java:53)*
>>>>>
>>>>>> * at org.apache.hadoop.mapred.ReduceTask.runOldReducer(
>>>>>>
>>>>> ReduceTask.java:522)*
>>>>>
>>>>>> * at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)*
>>>>>> * at
>>>>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(
>>>>>>
>>>>> LocalJobRunner.java:398)*
>>>>>
>>>>>> *2015-04-07 09:38:12,408 ERROR indexer.IndexingJob - Indexer:
>>>>>> java.io.IOException: Job failed!*
>>>>>> * at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)*
>>>>>> * at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)*
>>>>>> * at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)*
>>>>>> * at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)*
>>>>>> * at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)*
>>>>>>
>>>>>>> On Tue, 7 Apr 2015 at 03:18 Jeff Cocking <je...@gmail.com>
>>>>>>>
>>>>>> wrote:
>>>>>> With Solr5.0.0 you can skip that step. Solr will auto create your
>>>>>> schema
>>>>>> document based on the data being provided.
>>>>>>> One of the new features with Solr5 is the install/service feature. I
>>>>>>>
>>>>>> did a
>>>>>> quick write up on how to install Solr5 on Centos. Might be something
>>>>>>> useful there for you.
>>>>>>>
>>>>>>> http://www.cocking.com/apache-solr-5-0-install-on-centos-7/
>>>>>>>
>>>>>>> jeff
>>>>>>>
>>>>>>> On Mon, Apr 6, 2015 at 3:13 PM, Anchit Jain <anchitjain1234@gmail.com
>>>>>>> wrote:
>>>>>>>
>>>>>>> I want to index nutch results using *Solr 5.0* but as mentioned in
>>>>>>>> https://wiki.apache.org/nutch/NutchTutorial there is no directory
>>>>>>>> ${APACHE_SOLR_HOME}/example/solr/collection1/conf/
>>>>>>>> in solr 5.0 . So where I have to copy *schema.xml*?
>>>>>>>> Also there is no *start.jar* present in example directory.
>>>>>>>>
Re: Nutch 1.9 integration with Solr 5.0.0
Posted by Jeff Cocking <je...@gmail.com>.
There can be numerous reasons....Hosts.conf, firewall, etc. These are all
unique to your system.
Have you viewed the solr admin panel via a browser? This is a critical
step in the installation. This validates SOLR can accept HTTP commands.
On Tue, Apr 7, 2015 at 9:53 AM, Anchit Jain <an...@gmail.com>
wrote:
> I created a new core named *foo*. Than I copied the *schema.xml* from
> *nutch* into *var/solr/data/foo/conf* with changes as described in*
> https://wiki.apache.org/nutch/NutchTutorial*.
> I changed the url to*http://localhost:8983/solr/#/foo*
> so new command is
> "*bin/nutch solrindex http://localhost:8983/solr/#/foo crawl/crawldb/
> -linkdb crawl/linkdb/ crawl/segments/20150406231502/ -filter -normalize*"
> But now I am getting error
> *org.apache.solr.common.SolrException: HTTP method POST is not supported
> by this URL*
> *
> *
> Is some other change is also required in URL to support POST requests?
>
> Full log
>
> 2015-04-07 20:10:56,068 INFO indexer.IndexingJob - Indexer: starting at
> 2015-04-07 20:10:56
> 2015-04-07 20:10:56,178 INFO indexer.IndexingJob - Indexer: deleting gone
> documents: false
> 2015-04-07 20:10:56,178 INFO indexer.IndexingJob - Indexer: URL
> filtering: true
> 2015-04-07 20:10:56,178 INFO indexer.IndexingJob - Indexer: URL
> normalizing: true
> 2015-04-07 20:10:56,727 INFO indexer.IndexWriters - Adding
> org.apache.nutch.indexwriter.solr.SolrIndexWriter
> 2015-04-07 20:10:56,727 INFO indexer.IndexingJob - Active IndexWriters :
> SOLRIndexWriter
> solr.server.url : URL of the SOLR instance (mandatory)
> solr.commit.size : buffer size when sending to SOLR (default 1000)
> solr.mapping.file : name of the mapping file for fields (default
> solrindex-mapping.xml)
> solr.auth : use authentication (default false)
> solr.auth.username : use authentication (default false)
> solr.auth : username for authentication
> solr.auth.password : password for authentication
>
>
> 2015-04-07 20:10:56,772 INFO indexer.IndexerMapReduce - IndexerMapReduce:
> crawldb: crawl/crawldb
> 2015-04-07 20:10:56,772 INFO indexer.IndexerMapReduce - IndexerMapReduce:
> linkdb: crawl/linkdb
> 2015-04-07 20:10:56,772 INFO indexer.IndexerMapReduce -
> IndexerMapReduces: adding segment: crawl/segments/20150406231502
> 2015-04-07 20:10:57,205 WARN util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
> 2015-04-07 20:10:58,020 INFO anchor.AnchorIndexingFilter - Anchor
> deduplication is: off
> 2015-04-07 20:10:58,134 INFO regex.RegexURLNormalizer - can't find rules
> for scope 'indexer', using default
> 2015-04-07 20:11:00,114 INFO regex.RegexURLNormalizer - can't find rules
> for scope 'indexer', using default
> 2015-04-07 20:11:01,205 INFO regex.RegexURLNormalizer - can't find rules
> for scope 'indexer', using default
> 2015-04-07 20:11:01,344 INFO regex.RegexURLNormalizer - can't find rules
> for scope 'indexer', using default
> 2015-04-07 20:11:01,577 INFO regex.RegexURLNormalizer - can't find rules
> for scope 'indexer', using default
> 2015-04-07 20:11:01,788 INFO regex.RegexURLNormalizer - can't find rules
> for scope 'indexer', using default
> 2015-04-07 20:11:01,921 INFO indexer.IndexWriters - Adding
> org.apache.nutch.indexwriter.solr.SolrIndexWriter
> 2015-04-07 20:11:01,986 INFO solr.SolrMappingReader - source: content
> dest: content
> 2015-04-07 20:11:01,986 INFO solr.SolrMappingReader - source: title dest:
> title
> 2015-04-07 20:11:01,986 INFO solr.SolrMappingReader - source: host dest:
> host
> 2015-04-07 20:11:01,986 INFO solr.SolrMappingReader - source: segment
> dest: segment
> 2015-04-07 20:11:01,987 INFO solr.SolrMappingReader - source: boost dest:
> boost
> 2015-04-07 20:11:01,987 INFO solr.SolrMappingReader - source: digest
> dest: digest
> 2015-04-07 20:11:01,987 INFO solr.SolrMappingReader - source: tstamp
> dest: tstamp
> 2015-04-07 20:11:02,266 INFO solr.SolrIndexWriter - Indexing 250 documents
> 2015-04-07 20:11:02,267 INFO solr.SolrIndexWriter - Deleting 0 documents
> 2015-04-07 20:11:02,512 INFO solr.SolrIndexWriter - Indexing 250 documents
> *2015-04-07 20:11:02,576 WARN mapred.LocalJobRunner -
> job_local1831338118_0001*
> *org.apache.solr.common.SolrException: HTTP method POST is not supported
> by this URL*
> *
> *
> *HTTP method POST is not supported by this URL*
>
> request: http://localhost:8983/solr/
> at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.
> request(CommonsHttpSolrServer.java:430)
> at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.
> request(CommonsHttpSolrServer.java:244)
> at org.apache.solr.client.solrj.request.AbstractUpdateRequest.
> process(AbstractUpdateRequest.java:105)
> at org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(
> SolrIndexWriter.java:135)
> at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:88)
> at org.apache.nutch.indexer.IndexerOutputFormat$1.write(
> IndexerOutputFormat.java:50)
> at org.apache.nutch.indexer.IndexerOutputFormat$1.write(
> IndexerOutputFormat.java:41)
> at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(
> ReduceTask.java:458)
> at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:500)
> at org.apache.nutch.indexer.IndexerMapReduce.reduce(
> IndexerMapReduce.java:323)
> at org.apache.nutch.indexer.IndexerMapReduce.reduce(
> IndexerMapReduce.java:53)
> at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:522)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)
> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(
> LocalJobRunner.java:398)
> 2015-04-07 20:11:02,724 ERROR indexer.IndexingJob - Indexer:
> java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
>
>
> On Tuesday 07 April 2015 06:53 PM, Jeff Cocking wrote:
>
>> The command you are using is not pointing to the specific solr index you
>> created. The http://localhost:8983/solr needs to be changed to the URL
>> for
>> the core created. It should look like
>> http://localhost:8983/solr/#/new_core_name.
>>
>>
>> On Tue, Apr 7, 2015 at 2:33 AM, Anchit Jain <an...@gmail.com>
>> wrote:
>>
>> I followed instructions as given on your blog and created a new core for
>>> nutch data and copied schema.xml of nutch into it.
>>> Then I run the following command in nutch working directory
>>> bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/ -linkdb
>>> crawl/linkdb/ crawl/segments/20150406231502/ -filter -normalize
>>>
>>> But then also the same error is coming as like previous runs.
>>>
>>> On Tue, 7 Apr 2015 at 10:00 Jeff Cocking <je...@gmail.com> wrote:
>>>
>>> Solr5 is multicore by default. You have not finished the install by
>>>> setting up solr5's core. I would suggest you look at the link I sent to
>>>> finish up your setup.
>>>>
>>>> After you finish your install your solr URL will be
>>>> http://localhost:8983/solr/#/core_name.
>>>>
>>>> Jeff Cocking
>>>>
>>>> I apologize for my brevity.
>>>> This was sent from my mobile device while I should be focusing on
>>>> something else.....
>>>> Like a meeting, driving, family, etc.
>>>>
>>>> On Apr 6, 2015, at 11:16 PM, Anchit Jain <an...@gmail.com>
>>>>>
>>>> wrote:
>>>>
>>>>> I have already installed Solr.I want to integrate it with nutch.
>>>>> Whenever I try to issue this command to nutch
>>>>> ""bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/
>>>>>
>>>> -linkdb
>>>
>>>> crawl/linkdb/ crawl/segments/20150406231502/ -filter -normalize"
>>>>>
>>>>> I always get a error
>>>>>
>>>>> Indexer: java.io.IOException: Job failed!
>>>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>>>>> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
>>>>> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>>> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
>>>>>
>>>>>
>>>>>
>>>>> Here is the complete hadoop log for the process.I have underlined the
>>>>>
>>>> error
>>>>
>>>>> part in it.
>>>>>
>>>>> 2015-04-07 09:38:06,613 INFO indexer.IndexingJob - Indexer: starting
>>>>>
>>>> at
>>>
>>>> 2015-04-07 09:38:06
>>>>> 2015-04-07 09:38:06,684 INFO indexer.IndexingJob - Indexer: deleting
>>>>>
>>>> gone
>>>>
>>>>> documents: false
>>>>> 2015-04-07 09:38:06,685 INFO indexer.IndexingJob - Indexer: URL
>>>>>
>>>> filtering:
>>>>
>>>>> true
>>>>> 2015-04-07 09:38:06,685 INFO indexer.IndexingJob - Indexer: URL
>>>>> normalizing: true
>>>>> 2015-04-07 09:38:06,893 INFO indexer.IndexWriters - Adding
>>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter
>>>>> 2015-04-07 09:38:06,893 INFO indexer.IndexingJob - Active
>>>>>
>>>> IndexWriters :
>>>
>>>> SOLRIndexWriter
>>>>> solr.server.url : URL of the SOLR instance (mandatory)
>>>>> solr.commit.size : buffer size when sending to SOLR (default 1000)
>>>>> solr.mapping.file : name of the mapping file for fields (default
>>>>> solrindex-mapping.xml)
>>>>> solr.auth : use authentication (default false)
>>>>> solr.auth.username : use authentication (default false)
>>>>> solr.auth : username for authentication
>>>>> solr.auth.password : password for authentication
>>>>>
>>>>>
>>>>> 2015-04-07 09:38:06,898 INFO indexer.IndexerMapReduce -
>>>>>
>>>> IndexerMapReduce:
>>>>
>>>>> crawldb: crawl/crawldb
>>>>> 2015-04-07 09:38:06,898 INFO indexer.IndexerMapReduce -
>>>>>
>>>> IndexerMapReduce:
>>>>
>>>>> linkdb: crawl/linkdb
>>>>> 2015-04-07 09:38:06,898 INFO indexer.IndexerMapReduce -
>>>>>
>>>> IndexerMapReduces:
>>>>
>>>>> adding segment: crawl/segments/20150406231502
>>>>> 2015-04-07 09:38:07,036 WARN util.NativeCodeLoader - Unable to load
>>>>> native-hadoop library for your platform... using builtin-java classes
>>>>>
>>>> where
>>>>
>>>>> applicable
>>>>> 2015-04-07 09:38:07,540 INFO anchor.AnchorIndexingFilter - Anchor
>>>>> deduplication is: off
>>>>> 2015-04-07 09:38:07,565 INFO regex.RegexURLNormalizer - can't find
>>>>>
>>>> rules
>>>
>>>> for scope 'indexer', using default
>>>>> 2015-04-07 09:38:09,552 INFO regex.RegexURLNormalizer - can't find
>>>>>
>>>> rules
>>>
>>>> for scope 'indexer', using default
>>>>> 2015-04-07 09:38:10,642 INFO regex.RegexURLNormalizer - can't find
>>>>>
>>>> rules
>>>
>>>> for scope 'indexer', using default
>>>>> 2015-04-07 09:38:10,734 INFO regex.RegexURLNormalizer - can't find
>>>>>
>>>> rules
>>>
>>>> for scope 'indexer', using default
>>>>> 2015-04-07 09:38:10,895 INFO regex.RegexURLNormalizer - can't find
>>>>>
>>>> rules
>>>
>>>> for scope 'indexer', using default
>>>>> 2015-04-07 09:38:11,088 INFO regex.RegexURLNormalizer - can't find
>>>>>
>>>> rules
>>>
>>>> for scope 'indexer', using default
>>>>> 2015-04-07 09:38:11,219 INFO indexer.IndexWriters - Adding
>>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter
>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: content
>>>>> dest: content
>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: title
>>>>>
>>>> dest:
>>>>
>>>>> title
>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: host
>>>>>
>>>> dest:
>>>
>>>> host
>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: segment
>>>>> dest: segment
>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: boost
>>>>>
>>>> dest:
>>>>
>>>>> boost
>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: digest
>>>>>
>>>> dest:
>>>>
>>>>> digest
>>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: tstamp
>>>>>
>>>> dest:
>>>>
>>>>> tstamp
>>>>> 2015-04-07 09:38:11,526 INFO solr.SolrIndexWriter - Indexing 250
>>>>>
>>>> documents
>>>>
>>>>> 2015-04-07 09:38:11,526 INFO solr.SolrIndexWriter - Deleting 0
>>>>>
>>>> documents
>>>
>>>> 2015-04-07 09:38:11,644 INFO solr.SolrIndexWriter - Indexing 250
>>>>>
>>>> documents
>>>>
>>>>> *2015-04-07 09:38:11,699 WARN mapred.LocalJobRunner -
>>>>> job_local1245074757_0001*
>>>>> *org.apache.solr.common.SolrException: Not Found*
>>>>>
>>>>> *Not Found*
>>>>>
>>>>> *request: http://localhost:8983/solr/update?wt=javabin&version=2
>>>>> <http://localhost:8983/solr/update?wt=javabin&version=2>*
>>>>> * at
>>>>> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.
>>>>>
>>>> request(CommonsHttpSolrServer.java:430)*
>>>>
>>>>> * at
>>>>> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.
>>>>>
>>>> request(CommonsHttpSolrServer.java:244)*
>>>>
>>>>> * at
>>>>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.
>>>>>
>>>> process(AbstractUpdateRequest.java:105)*
>>>>
>>>>> * at
>>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(
>>>>>
>>>> SolrIndexWriter.java:135)*
>>>>
>>>>> * at org.apache.nutch.indexer.IndexWriters.write(
>>>>> IndexWriters.java:88)*
>>>>> * at
>>>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(
>>>>>
>>>> IndexerOutputFormat.java:50)*
>>>>
>>>>> * at
>>>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(
>>>>>
>>>> IndexerOutputFormat.java:41)*
>>>>
>>>>> * at
>>>>> org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(
>>>>>
>>>> ReduceTask.java:458)*
>>>>
>>>>> * at
>>>>>
>>>> org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:500)*
>>>
>>>> * at
>>>>> org.apache.nutch.indexer.IndexerMapReduce.reduce(
>>>>>
>>>> IndexerMapReduce.java:323)*
>>>>
>>>>> * at
>>>>> org.apache.nutch.indexer.IndexerMapReduce.reduce(
>>>>>
>>>> IndexerMapReduce.java:53)*
>>>>
>>>>> * at org.apache.hadoop.mapred.ReduceTask.runOldReducer(
>>>>>
>>>> ReduceTask.java:522)*
>>>>
>>>>> * at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)*
>>>>> * at
>>>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(
>>>>>
>>>> LocalJobRunner.java:398)*
>>>>
>>>>> *2015-04-07 09:38:12,408 ERROR indexer.IndexingJob - Indexer:
>>>>> java.io.IOException: Job failed!*
>>>>> * at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)*
>>>>> * at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)*
>>>>> * at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)*
>>>>> * at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)*
>>>>> * at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)*
>>>>>
>>>>>> On Tue, 7 Apr 2015 at 03:18 Jeff Cocking <je...@gmail.com>
>>>>>>
>>>>> wrote:
>>>>
>>>>> With Solr5.0.0 you can skip that step. Solr will auto create your
>>>>>>
>>>>> schema
>>>>
>>>>> document based on the data being provided.
>>>>>>
>>>>>> One of the new features with Solr5 is the install/service feature. I
>>>>>>
>>>>> did a
>>>>
>>>>> quick write up on how to install Solr5 on Centos. Might be something
>>>>>> useful there for you.
>>>>>>
>>>>>> http://www.cocking.com/apache-solr-5-0-install-on-centos-7/
>>>>>>
>>>>>> jeff
>>>>>>
>>>>>> On Mon, Apr 6, 2015 at 3:13 PM, Anchit Jain <anchitjain1234@gmail.com
>>>>>> wrote:
>>>>>>
>>>>>> I want to index nutch results using *Solr 5.0* but as mentioned in
>>>>>>> https://wiki.apache.org/nutch/NutchTutorial there is no directory
>>>>>>> ${APACHE_SOLR_HOME}/example/solr/collection1/conf/
>>>>>>> in solr 5.0 . So where I have to copy *schema.xml*?
>>>>>>> Also there is no *start.jar* present in example directory.
>>>>>>>
>>>>>>
>
Re: Nutch 1.9 integration with Solr 5.0.0
Posted by Anchit Jain <an...@gmail.com>.
I created a new core named *foo*. Than I copied the *schema.xml* from
*nutch* into *var/solr/data/foo/conf* with changes as described
in*https://wiki.apache.org/nutch/NutchTutorial*.
I changed the url to*http://localhost:8983/solr/#/foo*
so new command is
"*bin/nutch solrindex http://localhost:8983/solr/#/foo crawl/crawldb/
-linkdb crawl/linkdb/ crawl/segments/20150406231502/ -filter -normalize*"
But now I am getting error
*org.apache.solr.common.SolrException: HTTP method POST is not supported
by this URL*
*
*
Is some other change is also required in URL to support POST requests?
Full log
2015-04-07 20:10:56,068 INFO indexer.IndexingJob - Indexer: starting at
2015-04-07 20:10:56
2015-04-07 20:10:56,178 INFO indexer.IndexingJob - Indexer: deleting
gone documents: false
2015-04-07 20:10:56,178 INFO indexer.IndexingJob - Indexer: URL
filtering: true
2015-04-07 20:10:56,178 INFO indexer.IndexingJob - Indexer: URL
normalizing: true
2015-04-07 20:10:56,727 INFO indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.solr.SolrIndexWriter
2015-04-07 20:10:56,727 INFO indexer.IndexingJob - Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : use authentication (default false)
solr.auth : username for authentication
solr.auth.password : password for authentication
2015-04-07 20:10:56,772 INFO indexer.IndexerMapReduce -
IndexerMapReduce: crawldb: crawl/crawldb
2015-04-07 20:10:56,772 INFO indexer.IndexerMapReduce -
IndexerMapReduce: linkdb: crawl/linkdb
2015-04-07 20:10:56,772 INFO indexer.IndexerMapReduce -
IndexerMapReduces: adding segment: crawl/segments/20150406231502
2015-04-07 20:10:57,205 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes
where applicable
2015-04-07 20:10:58,020 INFO anchor.AnchorIndexingFilter - Anchor
deduplication is: off
2015-04-07 20:10:58,134 INFO regex.RegexURLNormalizer - can't find
rules for scope 'indexer', using default
2015-04-07 20:11:00,114 INFO regex.RegexURLNormalizer - can't find
rules for scope 'indexer', using default
2015-04-07 20:11:01,205 INFO regex.RegexURLNormalizer - can't find
rules for scope 'indexer', using default
2015-04-07 20:11:01,344 INFO regex.RegexURLNormalizer - can't find
rules for scope 'indexer', using default
2015-04-07 20:11:01,577 INFO regex.RegexURLNormalizer - can't find
rules for scope 'indexer', using default
2015-04-07 20:11:01,788 INFO regex.RegexURLNormalizer - can't find
rules for scope 'indexer', using default
2015-04-07 20:11:01,921 INFO indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.solr.SolrIndexWriter
2015-04-07 20:11:01,986 INFO solr.SolrMappingReader - source: content
dest: content
2015-04-07 20:11:01,986 INFO solr.SolrMappingReader - source: title
dest: title
2015-04-07 20:11:01,986 INFO solr.SolrMappingReader - source: host
dest: host
2015-04-07 20:11:01,986 INFO solr.SolrMappingReader - source: segment
dest: segment
2015-04-07 20:11:01,987 INFO solr.SolrMappingReader - source: boost
dest: boost
2015-04-07 20:11:01,987 INFO solr.SolrMappingReader - source: digest
dest: digest
2015-04-07 20:11:01,987 INFO solr.SolrMappingReader - source: tstamp
dest: tstamp
2015-04-07 20:11:02,266 INFO solr.SolrIndexWriter - Indexing 250 documents
2015-04-07 20:11:02,267 INFO solr.SolrIndexWriter - Deleting 0 documents
2015-04-07 20:11:02,512 INFO solr.SolrIndexWriter - Indexing 250 documents
*2015-04-07 20:11:02,576 WARN mapred.LocalJobRunner -
job_local1831338118_0001*
*org.apache.solr.common.SolrException: HTTP method POST is not supported
by this URL*
*
*
*HTTP method POST is not supported by this URL*
request: http://localhost:8983/solr/
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)
at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
at
org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:135)
at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:88)
at
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50)
at
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:41)
at
org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(ReduceTask.java:458)
at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:500)
at
org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:323)
at
org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:53)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:522)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
2015-04-07 20:11:02,724 ERROR indexer.IndexingJob - Indexer:
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
On Tuesday 07 April 2015 06:53 PM, Jeff Cocking wrote:
> The command you are using is not pointing to the specific solr index you
> created. The http://localhost:8983/solr needs to be changed to the URL for
> the core created. It should look like
> http://localhost:8983/solr/#/new_core_name.
>
>
> On Tue, Apr 7, 2015 at 2:33 AM, Anchit Jain <an...@gmail.com>
> wrote:
>
>> I followed instructions as given on your blog and created a new core for
>> nutch data and copied schema.xml of nutch into it.
>> Then I run the following command in nutch working directory
>> bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/ -linkdb
>> crawl/linkdb/ crawl/segments/20150406231502/ -filter -normalize
>>
>> But then also the same error is coming as like previous runs.
>>
>> On Tue, 7 Apr 2015 at 10:00 Jeff Cocking <je...@gmail.com> wrote:
>>
>>> Solr5 is multicore by default. You have not finished the install by
>>> setting up solr5's core. I would suggest you look at the link I sent to
>>> finish up your setup.
>>>
>>> After you finish your install your solr URL will be
>>> http://localhost:8983/solr/#/core_name.
>>>
>>> Jeff Cocking
>>>
>>> I apologize for my brevity.
>>> This was sent from my mobile device while I should be focusing on
>>> something else.....
>>> Like a meeting, driving, family, etc.
>>>
>>>> On Apr 6, 2015, at 11:16 PM, Anchit Jain <an...@gmail.com>
>>> wrote:
>>>> I have already installed Solr.I want to integrate it with nutch.
>>>> Whenever I try to issue this command to nutch
>>>> ""bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/
>> -linkdb
>>>> crawl/linkdb/ crawl/segments/20150406231502/ -filter -normalize"
>>>>
>>>> I always get a error
>>>>
>>>> Indexer: java.io.IOException: Job failed!
>>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>>>> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
>>>> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
>>>>
>>>>
>>>>
>>>> Here is the complete hadoop log for the process.I have underlined the
>>> error
>>>> part in it.
>>>>
>>>> 2015-04-07 09:38:06,613 INFO indexer.IndexingJob - Indexer: starting
>> at
>>>> 2015-04-07 09:38:06
>>>> 2015-04-07 09:38:06,684 INFO indexer.IndexingJob - Indexer: deleting
>>> gone
>>>> documents: false
>>>> 2015-04-07 09:38:06,685 INFO indexer.IndexingJob - Indexer: URL
>>> filtering:
>>>> true
>>>> 2015-04-07 09:38:06,685 INFO indexer.IndexingJob - Indexer: URL
>>>> normalizing: true
>>>> 2015-04-07 09:38:06,893 INFO indexer.IndexWriters - Adding
>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter
>>>> 2015-04-07 09:38:06,893 INFO indexer.IndexingJob - Active
>> IndexWriters :
>>>> SOLRIndexWriter
>>>> solr.server.url : URL of the SOLR instance (mandatory)
>>>> solr.commit.size : buffer size when sending to SOLR (default 1000)
>>>> solr.mapping.file : name of the mapping file for fields (default
>>>> solrindex-mapping.xml)
>>>> solr.auth : use authentication (default false)
>>>> solr.auth.username : use authentication (default false)
>>>> solr.auth : username for authentication
>>>> solr.auth.password : password for authentication
>>>>
>>>>
>>>> 2015-04-07 09:38:06,898 INFO indexer.IndexerMapReduce -
>>> IndexerMapReduce:
>>>> crawldb: crawl/crawldb
>>>> 2015-04-07 09:38:06,898 INFO indexer.IndexerMapReduce -
>>> IndexerMapReduce:
>>>> linkdb: crawl/linkdb
>>>> 2015-04-07 09:38:06,898 INFO indexer.IndexerMapReduce -
>>> IndexerMapReduces:
>>>> adding segment: crawl/segments/20150406231502
>>>> 2015-04-07 09:38:07,036 WARN util.NativeCodeLoader - Unable to load
>>>> native-hadoop library for your platform... using builtin-java classes
>>> where
>>>> applicable
>>>> 2015-04-07 09:38:07,540 INFO anchor.AnchorIndexingFilter - Anchor
>>>> deduplication is: off
>>>> 2015-04-07 09:38:07,565 INFO regex.RegexURLNormalizer - can't find
>> rules
>>>> for scope 'indexer', using default
>>>> 2015-04-07 09:38:09,552 INFO regex.RegexURLNormalizer - can't find
>> rules
>>>> for scope 'indexer', using default
>>>> 2015-04-07 09:38:10,642 INFO regex.RegexURLNormalizer - can't find
>> rules
>>>> for scope 'indexer', using default
>>>> 2015-04-07 09:38:10,734 INFO regex.RegexURLNormalizer - can't find
>> rules
>>>> for scope 'indexer', using default
>>>> 2015-04-07 09:38:10,895 INFO regex.RegexURLNormalizer - can't find
>> rules
>>>> for scope 'indexer', using default
>>>> 2015-04-07 09:38:11,088 INFO regex.RegexURLNormalizer - can't find
>> rules
>>>> for scope 'indexer', using default
>>>> 2015-04-07 09:38:11,219 INFO indexer.IndexWriters - Adding
>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter
>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: content
>>>> dest: content
>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: title
>>> dest:
>>>> title
>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: host
>> dest:
>>>> host
>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: segment
>>>> dest: segment
>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: boost
>>> dest:
>>>> boost
>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: digest
>>> dest:
>>>> digest
>>>> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: tstamp
>>> dest:
>>>> tstamp
>>>> 2015-04-07 09:38:11,526 INFO solr.SolrIndexWriter - Indexing 250
>>> documents
>>>> 2015-04-07 09:38:11,526 INFO solr.SolrIndexWriter - Deleting 0
>> documents
>>>> 2015-04-07 09:38:11,644 INFO solr.SolrIndexWriter - Indexing 250
>>> documents
>>>> *2015-04-07 09:38:11,699 WARN mapred.LocalJobRunner -
>>>> job_local1245074757_0001*
>>>> *org.apache.solr.common.SolrException: Not Found*
>>>>
>>>> *Not Found*
>>>>
>>>> *request: http://localhost:8983/solr/update?wt=javabin&version=2
>>>> <http://localhost:8983/solr/update?wt=javabin&version=2>*
>>>> * at
>>>> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.
>>> request(CommonsHttpSolrServer.java:430)*
>>>> * at
>>>> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.
>>> request(CommonsHttpSolrServer.java:244)*
>>>> * at
>>>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.
>>> process(AbstractUpdateRequest.java:105)*
>>>> * at
>>>> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(
>>> SolrIndexWriter.java:135)*
>>>> * at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:88)*
>>>> * at
>>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(
>>> IndexerOutputFormat.java:50)*
>>>> * at
>>>> org.apache.nutch.indexer.IndexerOutputFormat$1.write(
>>> IndexerOutputFormat.java:41)*
>>>> * at
>>>> org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(
>>> ReduceTask.java:458)*
>>>> * at
>> org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:500)*
>>>> * at
>>>> org.apache.nutch.indexer.IndexerMapReduce.reduce(
>>> IndexerMapReduce.java:323)*
>>>> * at
>>>> org.apache.nutch.indexer.IndexerMapReduce.reduce(
>>> IndexerMapReduce.java:53)*
>>>> * at org.apache.hadoop.mapred.ReduceTask.runOldReducer(
>>> ReduceTask.java:522)*
>>>> * at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)*
>>>> * at
>>>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(
>>> LocalJobRunner.java:398)*
>>>> *2015-04-07 09:38:12,408 ERROR indexer.IndexingJob - Indexer:
>>>> java.io.IOException: Job failed!*
>>>> * at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)*
>>>> * at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)*
>>>> * at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)*
>>>> * at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)*
>>>> * at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)*
>>>>> On Tue, 7 Apr 2015 at 03:18 Jeff Cocking <je...@gmail.com>
>>> wrote:
>>>>> With Solr5.0.0 you can skip that step. Solr will auto create your
>>> schema
>>>>> document based on the data being provided.
>>>>>
>>>>> One of the new features with Solr5 is the install/service feature. I
>>> did a
>>>>> quick write up on how to install Solr5 on Centos. Might be something
>>>>> useful there for you.
>>>>>
>>>>> http://www.cocking.com/apache-solr-5-0-install-on-centos-7/
>>>>>
>>>>> jeff
>>>>>
>>>>> On Mon, Apr 6, 2015 at 3:13 PM, Anchit Jain <anchitjain1234@gmail.com
>>>>> wrote:
>>>>>
>>>>>> I want to index nutch results using *Solr 5.0* but as mentioned in
>>>>>> https://wiki.apache.org/nutch/NutchTutorial there is no directory
>>>>>> ${APACHE_SOLR_HOME}/example/solr/collection1/conf/
>>>>>> in solr 5.0 . So where I have to copy *schema.xml*?
>>>>>> Also there is no *start.jar* present in example directory.
Re: Nutch 1.9 integration with Solr 5.0.0
Posted by Jeff Cocking <je...@gmail.com>.
The command you are using is not pointing to the specific solr index you
created. The http://localhost:8983/solr needs to be changed to the URL for
the core created. It should look like
http://localhost:8983/solr/#/new_core_name.
On Tue, Apr 7, 2015 at 2:33 AM, Anchit Jain <an...@gmail.com>
wrote:
> I followed instructions as given on your blog and created a new core for
> nutch data and copied schema.xml of nutch into it.
> Then I run the following command in nutch working directory
> bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/ -linkdb
> crawl/linkdb/ crawl/segments/20150406231502/ -filter -normalize
>
> But then also the same error is coming as like previous runs.
>
> On Tue, 7 Apr 2015 at 10:00 Jeff Cocking <je...@gmail.com> wrote:
>
> > Solr5 is multicore by default. You have not finished the install by
> > setting up solr5's core. I would suggest you look at the link I sent to
> > finish up your setup.
> >
> > After you finish your install your solr URL will be
> > http://localhost:8983/solr/#/core_name.
> >
> > Jeff Cocking
> >
> > I apologize for my brevity.
> > This was sent from my mobile device while I should be focusing on
> > something else.....
> > Like a meeting, driving, family, etc.
> >
> > > On Apr 6, 2015, at 11:16 PM, Anchit Jain <an...@gmail.com>
> > wrote:
> > >
> > > I have already installed Solr.I want to integrate it with nutch.
> > > Whenever I try to issue this command to nutch
> > > ""bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/
> -linkdb
> > > crawl/linkdb/ crawl/segments/20150406231502/ -filter -normalize"
> > >
> > > I always get a error
> > >
> > > Indexer: java.io.IOException: Job failed!
> > > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
> > > at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
> > > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
> > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > > at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
> > >
> > >
> > >
> > > Here is the complete hadoop log for the process.I have underlined the
> > error
> > > part in it.
> > >
> > > 2015-04-07 09:38:06,613 INFO indexer.IndexingJob - Indexer: starting
> at
> > > 2015-04-07 09:38:06
> > > 2015-04-07 09:38:06,684 INFO indexer.IndexingJob - Indexer: deleting
> > gone
> > > documents: false
> > > 2015-04-07 09:38:06,685 INFO indexer.IndexingJob - Indexer: URL
> > filtering:
> > > true
> > > 2015-04-07 09:38:06,685 INFO indexer.IndexingJob - Indexer: URL
> > > normalizing: true
> > > 2015-04-07 09:38:06,893 INFO indexer.IndexWriters - Adding
> > > org.apache.nutch.indexwriter.solr.SolrIndexWriter
> > > 2015-04-07 09:38:06,893 INFO indexer.IndexingJob - Active
> IndexWriters :
> > > SOLRIndexWriter
> > > solr.server.url : URL of the SOLR instance (mandatory)
> > > solr.commit.size : buffer size when sending to SOLR (default 1000)
> > > solr.mapping.file : name of the mapping file for fields (default
> > > solrindex-mapping.xml)
> > > solr.auth : use authentication (default false)
> > > solr.auth.username : use authentication (default false)
> > > solr.auth : username for authentication
> > > solr.auth.password : password for authentication
> > >
> > >
> > > 2015-04-07 09:38:06,898 INFO indexer.IndexerMapReduce -
> > IndexerMapReduce:
> > > crawldb: crawl/crawldb
> > > 2015-04-07 09:38:06,898 INFO indexer.IndexerMapReduce -
> > IndexerMapReduce:
> > > linkdb: crawl/linkdb
> > > 2015-04-07 09:38:06,898 INFO indexer.IndexerMapReduce -
> > IndexerMapReduces:
> > > adding segment: crawl/segments/20150406231502
> > > 2015-04-07 09:38:07,036 WARN util.NativeCodeLoader - Unable to load
> > > native-hadoop library for your platform... using builtin-java classes
> > where
> > > applicable
> > > 2015-04-07 09:38:07,540 INFO anchor.AnchorIndexingFilter - Anchor
> > > deduplication is: off
> > > 2015-04-07 09:38:07,565 INFO regex.RegexURLNormalizer - can't find
> rules
> > > for scope 'indexer', using default
> > > 2015-04-07 09:38:09,552 INFO regex.RegexURLNormalizer - can't find
> rules
> > > for scope 'indexer', using default
> > > 2015-04-07 09:38:10,642 INFO regex.RegexURLNormalizer - can't find
> rules
> > > for scope 'indexer', using default
> > > 2015-04-07 09:38:10,734 INFO regex.RegexURLNormalizer - can't find
> rules
> > > for scope 'indexer', using default
> > > 2015-04-07 09:38:10,895 INFO regex.RegexURLNormalizer - can't find
> rules
> > > for scope 'indexer', using default
> > > 2015-04-07 09:38:11,088 INFO regex.RegexURLNormalizer - can't find
> rules
> > > for scope 'indexer', using default
> > > 2015-04-07 09:38:11,219 INFO indexer.IndexWriters - Adding
> > > org.apache.nutch.indexwriter.solr.SolrIndexWriter
> > > 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: content
> > > dest: content
> > > 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: title
> > dest:
> > > title
> > > 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: host
> dest:
> > > host
> > > 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: segment
> > > dest: segment
> > > 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: boost
> > dest:
> > > boost
> > > 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: digest
> > dest:
> > > digest
> > > 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: tstamp
> > dest:
> > > tstamp
> > > 2015-04-07 09:38:11,526 INFO solr.SolrIndexWriter - Indexing 250
> > documents
> > > 2015-04-07 09:38:11,526 INFO solr.SolrIndexWriter - Deleting 0
> documents
> > > 2015-04-07 09:38:11,644 INFO solr.SolrIndexWriter - Indexing 250
> > documents
> > > *2015-04-07 09:38:11,699 WARN mapred.LocalJobRunner -
> > > job_local1245074757_0001*
> > > *org.apache.solr.common.SolrException: Not Found*
> > >
> > > *Not Found*
> > >
> > > *request: http://localhost:8983/solr/update?wt=javabin&version=2
> > > <http://localhost:8983/solr/update?wt=javabin&version=2>*
> > > * at
> > > org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.
> > request(CommonsHttpSolrServer.java:430)*
> > > * at
> > > org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.
> > request(CommonsHttpSolrServer.java:244)*
> > > * at
> > > org.apache.solr.client.solrj.request.AbstractUpdateRequest.
> > process(AbstractUpdateRequest.java:105)*
> > > * at
> > > org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(
> > SolrIndexWriter.java:135)*
> > > * at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:88)*
> > > * at
> > > org.apache.nutch.indexer.IndexerOutputFormat$1.write(
> > IndexerOutputFormat.java:50)*
> > > * at
> > > org.apache.nutch.indexer.IndexerOutputFormat$1.write(
> > IndexerOutputFormat.java:41)*
> > > * at
> > > org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(
> > ReduceTask.java:458)*
> > > * at
> org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:500)*
> > > * at
> > > org.apache.nutch.indexer.IndexerMapReduce.reduce(
> > IndexerMapReduce.java:323)*
> > > * at
> > > org.apache.nutch.indexer.IndexerMapReduce.reduce(
> > IndexerMapReduce.java:53)*
> > > * at org.apache.hadoop.mapred.ReduceTask.runOldReducer(
> > ReduceTask.java:522)*
> > > * at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)*
> > > * at
> > > org.apache.hadoop.mapred.LocalJobRunner$Job.run(
> > LocalJobRunner.java:398)*
> > > *2015-04-07 09:38:12,408 ERROR indexer.IndexingJob - Indexer:
> > > java.io.IOException: Job failed!*
> > > * at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)*
> > > * at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)*
> > > * at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)*
> > > * at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)*
> > > * at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)*
> > >> On Tue, 7 Apr 2015 at 03:18 Jeff Cocking <je...@gmail.com>
> > wrote:
> > >>
> > >> With Solr5.0.0 you can skip that step. Solr will auto create your
> > schema
> > >> document based on the data being provided.
> > >>
> > >> One of the new features with Solr5 is the install/service feature. I
> > did a
> > >> quick write up on how to install Solr5 on Centos. Might be something
> > >> useful there for you.
> > >>
> > >> http://www.cocking.com/apache-solr-5-0-install-on-centos-7/
> > >>
> > >> jeff
> > >>
> > >> On Mon, Apr 6, 2015 at 3:13 PM, Anchit Jain <anchitjain1234@gmail.com
> >
> > >> wrote:
> > >>
> > >>> I want to index nutch results using *Solr 5.0* but as mentioned in
> > >>> https://wiki.apache.org/nutch/NutchTutorial there is no directory
> > >>> ${APACHE_SOLR_HOME}/example/solr/collection1/conf/
> > >>> in solr 5.0 . So where I have to copy *schema.xml*?
> > >>> Also there is no *start.jar* present in example directory.
> > >>
> >
>
Re: Nutch 1.9 integration with Solr 5.0.0
Posted by Anchit Jain <an...@gmail.com>.
I followed instructions as given on your blog and created a new core for
nutch data and copied schema.xml of nutch into it.
Then I run the following command in nutch working directory
bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/ -linkdb
crawl/linkdb/ crawl/segments/20150406231502/ -filter -normalize
But then also the same error is coming as like previous runs.
On Tue, 7 Apr 2015 at 10:00 Jeff Cocking <je...@gmail.com> wrote:
> Solr5 is multicore by default. You have not finished the install by
> setting up solr5's core. I would suggest you look at the link I sent to
> finish up your setup.
>
> After you finish your install your solr URL will be
> http://localhost:8983/solr/#/core_name.
>
> Jeff Cocking
>
> I apologize for my brevity.
> This was sent from my mobile device while I should be focusing on
> something else.....
> Like a meeting, driving, family, etc.
>
> > On Apr 6, 2015, at 11:16 PM, Anchit Jain <an...@gmail.com>
> wrote:
> >
> > I have already installed Solr.I want to integrate it with nutch.
> > Whenever I try to issue this command to nutch
> > ""bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/ -linkdb
> > crawl/linkdb/ crawl/segments/20150406231502/ -filter -normalize"
> >
> > I always get a error
> >
> > Indexer: java.io.IOException: Job failed!
> > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
> > at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
> > at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
> > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
> >
> >
> >
> > Here is the complete hadoop log for the process.I have underlined the
> error
> > part in it.
> >
> > 2015-04-07 09:38:06,613 INFO indexer.IndexingJob - Indexer: starting at
> > 2015-04-07 09:38:06
> > 2015-04-07 09:38:06,684 INFO indexer.IndexingJob - Indexer: deleting
> gone
> > documents: false
> > 2015-04-07 09:38:06,685 INFO indexer.IndexingJob - Indexer: URL
> filtering:
> > true
> > 2015-04-07 09:38:06,685 INFO indexer.IndexingJob - Indexer: URL
> > normalizing: true
> > 2015-04-07 09:38:06,893 INFO indexer.IndexWriters - Adding
> > org.apache.nutch.indexwriter.solr.SolrIndexWriter
> > 2015-04-07 09:38:06,893 INFO indexer.IndexingJob - Active IndexWriters :
> > SOLRIndexWriter
> > solr.server.url : URL of the SOLR instance (mandatory)
> > solr.commit.size : buffer size when sending to SOLR (default 1000)
> > solr.mapping.file : name of the mapping file for fields (default
> > solrindex-mapping.xml)
> > solr.auth : use authentication (default false)
> > solr.auth.username : use authentication (default false)
> > solr.auth : username for authentication
> > solr.auth.password : password for authentication
> >
> >
> > 2015-04-07 09:38:06,898 INFO indexer.IndexerMapReduce -
> IndexerMapReduce:
> > crawldb: crawl/crawldb
> > 2015-04-07 09:38:06,898 INFO indexer.IndexerMapReduce -
> IndexerMapReduce:
> > linkdb: crawl/linkdb
> > 2015-04-07 09:38:06,898 INFO indexer.IndexerMapReduce -
> IndexerMapReduces:
> > adding segment: crawl/segments/20150406231502
> > 2015-04-07 09:38:07,036 WARN util.NativeCodeLoader - Unable to load
> > native-hadoop library for your platform... using builtin-java classes
> where
> > applicable
> > 2015-04-07 09:38:07,540 INFO anchor.AnchorIndexingFilter - Anchor
> > deduplication is: off
> > 2015-04-07 09:38:07,565 INFO regex.RegexURLNormalizer - can't find rules
> > for scope 'indexer', using default
> > 2015-04-07 09:38:09,552 INFO regex.RegexURLNormalizer - can't find rules
> > for scope 'indexer', using default
> > 2015-04-07 09:38:10,642 INFO regex.RegexURLNormalizer - can't find rules
> > for scope 'indexer', using default
> > 2015-04-07 09:38:10,734 INFO regex.RegexURLNormalizer - can't find rules
> > for scope 'indexer', using default
> > 2015-04-07 09:38:10,895 INFO regex.RegexURLNormalizer - can't find rules
> > for scope 'indexer', using default
> > 2015-04-07 09:38:11,088 INFO regex.RegexURLNormalizer - can't find rules
> > for scope 'indexer', using default
> > 2015-04-07 09:38:11,219 INFO indexer.IndexWriters - Adding
> > org.apache.nutch.indexwriter.solr.SolrIndexWriter
> > 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: content
> > dest: content
> > 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: title
> dest:
> > title
> > 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: host dest:
> > host
> > 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: segment
> > dest: segment
> > 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: boost
> dest:
> > boost
> > 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: digest
> dest:
> > digest
> > 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: tstamp
> dest:
> > tstamp
> > 2015-04-07 09:38:11,526 INFO solr.SolrIndexWriter - Indexing 250
> documents
> > 2015-04-07 09:38:11,526 INFO solr.SolrIndexWriter - Deleting 0 documents
> > 2015-04-07 09:38:11,644 INFO solr.SolrIndexWriter - Indexing 250
> documents
> > *2015-04-07 09:38:11,699 WARN mapred.LocalJobRunner -
> > job_local1245074757_0001*
> > *org.apache.solr.common.SolrException: Not Found*
> >
> > *Not Found*
> >
> > *request: http://localhost:8983/solr/update?wt=javabin&version=2
> > <http://localhost:8983/solr/update?wt=javabin&version=2>*
> > * at
> > org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.
> request(CommonsHttpSolrServer.java:430)*
> > * at
> > org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.
> request(CommonsHttpSolrServer.java:244)*
> > * at
> > org.apache.solr.client.solrj.request.AbstractUpdateRequest.
> process(AbstractUpdateRequest.java:105)*
> > * at
> > org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(
> SolrIndexWriter.java:135)*
> > * at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:88)*
> > * at
> > org.apache.nutch.indexer.IndexerOutputFormat$1.write(
> IndexerOutputFormat.java:50)*
> > * at
> > org.apache.nutch.indexer.IndexerOutputFormat$1.write(
> IndexerOutputFormat.java:41)*
> > * at
> > org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(
> ReduceTask.java:458)*
> > * at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:500)*
> > * at
> > org.apache.nutch.indexer.IndexerMapReduce.reduce(
> IndexerMapReduce.java:323)*
> > * at
> > org.apache.nutch.indexer.IndexerMapReduce.reduce(
> IndexerMapReduce.java:53)*
> > * at org.apache.hadoop.mapred.ReduceTask.runOldReducer(
> ReduceTask.java:522)*
> > * at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)*
> > * at
> > org.apache.hadoop.mapred.LocalJobRunner$Job.run(
> LocalJobRunner.java:398)*
> > *2015-04-07 09:38:12,408 ERROR indexer.IndexingJob - Indexer:
> > java.io.IOException: Job failed!*
> > * at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)*
> > * at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)*
> > * at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)*
> > * at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)*
> > * at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)*
> >> On Tue, 7 Apr 2015 at 03:18 Jeff Cocking <je...@gmail.com>
> wrote:
> >>
> >> With Solr5.0.0 you can skip that step. Solr will auto create your
> schema
> >> document based on the data being provided.
> >>
> >> One of the new features with Solr5 is the install/service feature. I
> did a
> >> quick write up on how to install Solr5 on Centos. Might be something
> >> useful there for you.
> >>
> >> http://www.cocking.com/apache-solr-5-0-install-on-centos-7/
> >>
> >> jeff
> >>
> >> On Mon, Apr 6, 2015 at 3:13 PM, Anchit Jain <an...@gmail.com>
> >> wrote:
> >>
> >>> I want to index nutch results using *Solr 5.0* but as mentioned in
> >>> https://wiki.apache.org/nutch/NutchTutorial there is no directory
> >>> ${APACHE_SOLR_HOME}/example/solr/collection1/conf/
> >>> in solr 5.0 . So where I have to copy *schema.xml*?
> >>> Also there is no *start.jar* present in example directory.
> >>
>
Re: Nutch 1.9 integration with Solr 5.0.0
Posted by Jeff Cocking <je...@gmail.com>.
Solr5 is multicore by default. You have not finished the install by setting up solr5's core. I would suggest you look at the link I sent to finish up your setup.
After you finish your install your solr URL will be http://localhost:8983/solr/#/core_name.
Jeff Cocking
I apologize for my brevity.
This was sent from my mobile device while I should be focusing on something else.....
Like a meeting, driving, family, etc.
> On Apr 6, 2015, at 11:16 PM, Anchit Jain <an...@gmail.com> wrote:
>
> I have already installed Solr.I want to integrate it with nutch.
> Whenever I try to issue this command to nutch
> ""bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/ -linkdb
> crawl/linkdb/ crawl/segments/20150406231502/ -filter -normalize"
>
> I always get a error
>
> Indexer: java.io.IOException: Job failed!
> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
> at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
> at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
>
>
>
> Here is the complete hadoop log for the process.I have underlined the error
> part in it.
>
> 2015-04-07 09:38:06,613 INFO indexer.IndexingJob - Indexer: starting at
> 2015-04-07 09:38:06
> 2015-04-07 09:38:06,684 INFO indexer.IndexingJob - Indexer: deleting gone
> documents: false
> 2015-04-07 09:38:06,685 INFO indexer.IndexingJob - Indexer: URL filtering:
> true
> 2015-04-07 09:38:06,685 INFO indexer.IndexingJob - Indexer: URL
> normalizing: true
> 2015-04-07 09:38:06,893 INFO indexer.IndexWriters - Adding
> org.apache.nutch.indexwriter.solr.SolrIndexWriter
> 2015-04-07 09:38:06,893 INFO indexer.IndexingJob - Active IndexWriters :
> SOLRIndexWriter
> solr.server.url : URL of the SOLR instance (mandatory)
> solr.commit.size : buffer size when sending to SOLR (default 1000)
> solr.mapping.file : name of the mapping file for fields (default
> solrindex-mapping.xml)
> solr.auth : use authentication (default false)
> solr.auth.username : use authentication (default false)
> solr.auth : username for authentication
> solr.auth.password : password for authentication
>
>
> 2015-04-07 09:38:06,898 INFO indexer.IndexerMapReduce - IndexerMapReduce:
> crawldb: crawl/crawldb
> 2015-04-07 09:38:06,898 INFO indexer.IndexerMapReduce - IndexerMapReduce:
> linkdb: crawl/linkdb
> 2015-04-07 09:38:06,898 INFO indexer.IndexerMapReduce - IndexerMapReduces:
> adding segment: crawl/segments/20150406231502
> 2015-04-07 09:38:07,036 WARN util.NativeCodeLoader - Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
> 2015-04-07 09:38:07,540 INFO anchor.AnchorIndexingFilter - Anchor
> deduplication is: off
> 2015-04-07 09:38:07,565 INFO regex.RegexURLNormalizer - can't find rules
> for scope 'indexer', using default
> 2015-04-07 09:38:09,552 INFO regex.RegexURLNormalizer - can't find rules
> for scope 'indexer', using default
> 2015-04-07 09:38:10,642 INFO regex.RegexURLNormalizer - can't find rules
> for scope 'indexer', using default
> 2015-04-07 09:38:10,734 INFO regex.RegexURLNormalizer - can't find rules
> for scope 'indexer', using default
> 2015-04-07 09:38:10,895 INFO regex.RegexURLNormalizer - can't find rules
> for scope 'indexer', using default
> 2015-04-07 09:38:11,088 INFO regex.RegexURLNormalizer - can't find rules
> for scope 'indexer', using default
> 2015-04-07 09:38:11,219 INFO indexer.IndexWriters - Adding
> org.apache.nutch.indexwriter.solr.SolrIndexWriter
> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: content
> dest: content
> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: title dest:
> title
> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: host dest:
> host
> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: segment
> dest: segment
> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: boost dest:
> boost
> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: digest dest:
> digest
> 2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: tstamp dest:
> tstamp
> 2015-04-07 09:38:11,526 INFO solr.SolrIndexWriter - Indexing 250 documents
> 2015-04-07 09:38:11,526 INFO solr.SolrIndexWriter - Deleting 0 documents
> 2015-04-07 09:38:11,644 INFO solr.SolrIndexWriter - Indexing 250 documents
> *2015-04-07 09:38:11,699 WARN mapred.LocalJobRunner -
> job_local1245074757_0001*
> *org.apache.solr.common.SolrException: Not Found*
>
> *Not Found*
>
> *request: http://localhost:8983/solr/update?wt=javabin&version=2
> <http://localhost:8983/solr/update?wt=javabin&version=2>*
> * at
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)*
> * at
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)*
> * at
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)*
> * at
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:135)*
> * at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:88)*
> * at
> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50)*
> * at
> org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:41)*
> * at
> org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(ReduceTask.java:458)*
> * at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:500)*
> * at
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:323)*
> * at
> org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:53)*
> * at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:522)*
> * at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)*
> * at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)*
> *2015-04-07 09:38:12,408 ERROR indexer.IndexingJob - Indexer:
> java.io.IOException: Job failed!*
> * at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)*
> * at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)*
> * at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)*
> * at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)*
> * at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)*
>> On Tue, 7 Apr 2015 at 03:18 Jeff Cocking <je...@gmail.com> wrote:
>>
>> With Solr5.0.0 you can skip that step. Solr will auto create your schema
>> document based on the data being provided.
>>
>> One of the new features with Solr5 is the install/service feature. I did a
>> quick write up on how to install Solr5 on Centos. Might be something
>> useful there for you.
>>
>> http://www.cocking.com/apache-solr-5-0-install-on-centos-7/
>>
>> jeff
>>
>> On Mon, Apr 6, 2015 at 3:13 PM, Anchit Jain <an...@gmail.com>
>> wrote:
>>
>>> I want to index nutch results using *Solr 5.0* but as mentioned in
>>> https://wiki.apache.org/nutch/NutchTutorial there is no directory
>>> ${APACHE_SOLR_HOME}/example/solr/collection1/conf/
>>> in solr 5.0 . So where I have to copy *schema.xml*?
>>> Also there is no *start.jar* present in example directory.
>>
Re: Nutch 1.9 integration with Solr 5.0.0
Posted by Anchit Jain <an...@gmail.com>.
I have already installed Solr.I want to integrate it with nutch.
Whenever I try to issue this command to nutch
""bin/nutch solrindex http://localhost:8983/solr crawl/crawldb/ -linkdb
crawl/linkdb/ crawl/segments/20150406231502/ -filter -normalize"
I always get a error
Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
Here is the complete hadoop log for the process.I have underlined the error
part in it.
2015-04-07 09:38:06,613 INFO indexer.IndexingJob - Indexer: starting at
2015-04-07 09:38:06
2015-04-07 09:38:06,684 INFO indexer.IndexingJob - Indexer: deleting gone
documents: false
2015-04-07 09:38:06,685 INFO indexer.IndexingJob - Indexer: URL filtering:
true
2015-04-07 09:38:06,685 INFO indexer.IndexingJob - Indexer: URL
normalizing: true
2015-04-07 09:38:06,893 INFO indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.solr.SolrIndexWriter
2015-04-07 09:38:06,893 INFO indexer.IndexingJob - Active IndexWriters :
SOLRIndexWriter
solr.server.url : URL of the SOLR instance (mandatory)
solr.commit.size : buffer size when sending to SOLR (default 1000)
solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
solr.auth : use authentication (default false)
solr.auth.username : use authentication (default false)
solr.auth : username for authentication
solr.auth.password : password for authentication
2015-04-07 09:38:06,898 INFO indexer.IndexerMapReduce - IndexerMapReduce:
crawldb: crawl/crawldb
2015-04-07 09:38:06,898 INFO indexer.IndexerMapReduce - IndexerMapReduce:
linkdb: crawl/linkdb
2015-04-07 09:38:06,898 INFO indexer.IndexerMapReduce - IndexerMapReduces:
adding segment: crawl/segments/20150406231502
2015-04-07 09:38:07,036 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
2015-04-07 09:38:07,540 INFO anchor.AnchorIndexingFilter - Anchor
deduplication is: off
2015-04-07 09:38:07,565 INFO regex.RegexURLNormalizer - can't find rules
for scope 'indexer', using default
2015-04-07 09:38:09,552 INFO regex.RegexURLNormalizer - can't find rules
for scope 'indexer', using default
2015-04-07 09:38:10,642 INFO regex.RegexURLNormalizer - can't find rules
for scope 'indexer', using default
2015-04-07 09:38:10,734 INFO regex.RegexURLNormalizer - can't find rules
for scope 'indexer', using default
2015-04-07 09:38:10,895 INFO regex.RegexURLNormalizer - can't find rules
for scope 'indexer', using default
2015-04-07 09:38:11,088 INFO regex.RegexURLNormalizer - can't find rules
for scope 'indexer', using default
2015-04-07 09:38:11,219 INFO indexer.IndexWriters - Adding
org.apache.nutch.indexwriter.solr.SolrIndexWriter
2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: content
dest: content
2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: title dest:
title
2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: host dest:
host
2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: segment
dest: segment
2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: boost dest:
boost
2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: digest dest:
digest
2015-04-07 09:38:11,237 INFO solr.SolrMappingReader - source: tstamp dest:
tstamp
2015-04-07 09:38:11,526 INFO solr.SolrIndexWriter - Indexing 250 documents
2015-04-07 09:38:11,526 INFO solr.SolrIndexWriter - Deleting 0 documents
2015-04-07 09:38:11,644 INFO solr.SolrIndexWriter - Indexing 250 documents
*2015-04-07 09:38:11,699 WARN mapred.LocalJobRunner -
job_local1245074757_0001*
*org.apache.solr.common.SolrException: Not Found*
*Not Found*
*request: http://localhost:8983/solr/update?wt=javabin&version=2
<http://localhost:8983/solr/update?wt=javabin&version=2>*
* at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)*
* at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)*
* at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)*
* at
org.apache.nutch.indexwriter.solr.SolrIndexWriter.write(SolrIndexWriter.java:135)*
* at org.apache.nutch.indexer.IndexWriters.write(IndexWriters.java:88)*
* at
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:50)*
* at
org.apache.nutch.indexer.IndexerOutputFormat$1.write(IndexerOutputFormat.java:41)*
* at
org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.write(ReduceTask.java:458)*
* at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:500)*
* at
org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:323)*
* at
org.apache.nutch.indexer.IndexerMapReduce.reduce(IndexerMapReduce.java:53)*
* at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:522)*
* at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)*
* at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)*
*2015-04-07 09:38:12,408 ERROR indexer.IndexingJob - Indexer:
java.io.IOException: Job failed!*
* at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)*
* at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)*
* at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)*
* at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)*
* at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)*
On Tue, 7 Apr 2015 at 03:18 Jeff Cocking <je...@gmail.com> wrote:
> With Solr5.0.0 you can skip that step. Solr will auto create your schema
> document based on the data being provided.
>
> One of the new features with Solr5 is the install/service feature. I did a
> quick write up on how to install Solr5 on Centos. Might be something
> useful there for you.
>
> http://www.cocking.com/apache-solr-5-0-install-on-centos-7/
>
> jeff
>
> On Mon, Apr 6, 2015 at 3:13 PM, Anchit Jain <an...@gmail.com>
> wrote:
>
> > I want to index nutch results using *Solr 5.0* but as mentioned in
> > https://wiki.apache.org/nutch/NutchTutorial there is no directory
> > ${APACHE_SOLR_HOME}/example/solr/collection1/conf/
> > in solr 5.0 . So where I have to copy *schema.xml*?
> > Also there is no *start.jar* present in example directory.
> >
>
Re: Nutch 1.9 integration with Solr 5.0.0
Posted by Jeff Cocking <je...@gmail.com>.
With Solr5.0.0 you can skip that step. Solr will auto create your schema
document based on the data being provided.
One of the new features with Solr5 is the install/service feature. I did a
quick write up on how to install Solr5 on Centos. Might be something
useful there for you.
http://www.cocking.com/apache-solr-5-0-install-on-centos-7/
jeff
On Mon, Apr 6, 2015 at 3:13 PM, Anchit Jain <an...@gmail.com>
wrote:
> I want to index nutch results using *Solr 5.0* but as mentioned in
> https://wiki.apache.org/nutch/NutchTutorial there is no directory
> ${APACHE_SOLR_HOME}/example/solr/collection1/conf/
> in solr 5.0 . So where I have to copy *schema.xml*?
> Also there is no *start.jar* present in example directory.
>