You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Xavier Morera <xa...@familiamorera.com> on 2014/04/10 19:05:01 UTC

Pushing content to Solr from Nutch

Hi,

I have followed several Nutch tutorials - including the main one
http://wiki.apache.org/nutch/NutchTutorial - to crawl sites (which works, I
can see in the console as the pages get crawled and the directories built
with the data) but for the life of me I can't get anything posted to Solr.
The Solr console doesn't even squint, therefore Nutch is not sending
anything.

This is the command that I send over that crawls and in theory should also
post
bin/crawl urls/seed.txt TestCrawl http://localhost:8983/solr 2

But I found that I could also use this one when it is already crawled
bin/nutch solrindex http://localhost:8983/solr crawl/crawldb crawl/linkdb
crawl/segments/*

But no luck.

This is the only thing that called my attention but I read that by adding
the property below would work but doesn't work.
*No IndexWriters activated - check your configuration*

This is the property
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>

Any idea? Apache Nutch 1.8 running Java 1.6 via Cygwin on Windows.

-- 
*Xavier Morera*
email: xavier@familiamorera.com
CR: +(506) 8849 8866
US: +1 (305) 600 4919
skype: xmorera



-- 
*Xavier Morera*
email: xavier@familiamorera.com
CR: +(506) 8849 8866
US: +1 (305) 600 4919
skype: xmorera

Re: Pushing content to Solr from Nutch

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Xavier,

the log clearly indicates the problem (it's somewhat hidden behind the noise of Hadoop warnings):

Indexing 20140409233248 on SOLR index -> http://localhost:8983/solr/#/psnutch/
...
No IndexWriters activated - check your configuration

indexer-solr must be among plugin.includes. Looks like the
value of plugin.includes stems from an outdated example:
many of the plugins are not available any more.

Sebastian

On 04/10/2014 07:05 PM, Xavier Morera wrote:
> Hi,
> 
> I have followed several Nutch tutorials - including the main one
> http://wiki.apache.org/nutch/NutchTutorial - to crawl sites (which works, I can see in the console
> as the pages get crawled and the directories built with the data) but for the life of me I can't get
> anything posted to Solr. The Solr console doesn't even squint, therefore Nutch is not sending anything.
> 
> This is the command that I send over that crawls and in theory should also post
> bin/crawl urls/seed.txt TestCrawl http://localhost:8983/solr 2
> 
> But I found that I could also use this one when it is already crawled
> bin/nutch solrindex http://localhost:8983/solr crawl/crawldb crawl/linkdb crawl/segments/*
> 
> But no luck.
> 
> This is the only thing that called my attention but I read that by adding the property below would
> work but doesn't work.
> /No IndexWriters activated - check your configuration/
> 
> This is the property
> <property>
> <name>plugin.includes</name>
> <value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> </property>
> 
> Any idea? Apache Nutch 1.8 running Java 1.6 via Cygwin on Windows.
>  
> -- 
> *Xavier Morera*
> email: xavier@familiamorera.com <ma...@familiamorera.com>
> CR: +(506) 8849 8866 <tel:%2B%28506%29%208849%208866>
> US: +1 (305) 600 4919 <tel:%2B1%20%28305%29%20600%204919>
> skype: xmorera
> 
> 
> 
> -- 
> *Xavier Morera*
> email: xavier@familiamorera.com <ma...@familiamorera.com>
> CR: +(506) 8849 8866
> US: +1 (305) 600 4919
> skype: xmorera


Re: Pushing content to Solr from Nutch

Posted by Xavier Morera <xa...@familiamorera.com>.
Wait, ignore my last email. The issue is on the solr side!


On Thu, Apr 10, 2014 at 1:13 PM, Xavier Morera <xa...@familiamorera.com>wrote:

> Thanks Julien and Sebastian. Tried that and got the exception below. Is
> there a way of knowing more in detail what is the exception so that I can
> continue troubleshooting? I am getting really really close! I also attach
> the full output.
>
> This is the exception, but no additional info
> Indexer: java.io.IOException: Job failed!
>         at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
>         at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
>         at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)
>
>
> Also I found this which means that something is actually happening
> ndexing 20140410124128 on SOLR index -> http://localhost:8983/solr
> cygpath: can't convert empty path
> Indexer: starting at 2014-04-10 12:41:42
> Indexer: deleting gone documents: false
> Indexer: URL filtering: false
> Indexer: URL normalizing: false
> Active IndexWriters :
> SOLRIndexWriter
>         solr.server.url : URL of the SOLR instance (mandatory)
>         solr.commit.size : buffer size when sending to SOLR (default 1000)
>         solr.mapping.file : name of the mapping file for fields (default
> solrindex-mapping.xml)
>         solr.auth : use authentication (default false)
>         solr.auth.username : use authentication (default false)
>         solr.auth : username for authentication
>         solr.auth.password : password for authentication
>
>
> My full nutch-site.xml is
>
> *<?xml version="1.0"?>*
> *<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>*
> *<!-- Put site-specific property overrides in this file. -->*
> *<configuration>*
> * <property>*
> * <name>http.agent.name <http://http.agent.name></name>*
> * <value>nutch-solr-integration</value>*
> * </property>*
> * <property>*
> * <name>generate.max.per.host</name>*
> * <value>100</value>*
> * </property>*
>  * <property>*
> * <name>plugin.includes</name>*
> *
> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>*
> * </property>*
> * <property>*
> * <name>fs.file.impl</name>*
> *
> <value>com.conga.services.hadoop.patch.HADOOP_7682.WinLocalFileSystem</value>*
> * <description>Enables patch for issue HADOOP-7682 on
> Windows</description>*
> * </property>*
> *</configuration>*
>
> And for urls/site.txt I have
> http://www.trenurbano.co.cr
>
> And in regex-urlfilter.txt I have
> +^http://([a-z0-9]*\.)*trenurbano.co.cr/
>
> Thanks in advance,
> Xavier
>
>
>
> On Thu, Apr 10, 2014 at 12:35 PM, Julien Nioche <
> lists.digitalpebble@gmail.com> wrote:
>
>> Hi Xavier
>>
>> Your config file looks a bit outdated. Here are the values set by default
>> (see http://svn.apache.org/repos/asf/nutch/trunk/conf/nutch-default.xml)
>>
>> <property>
>>   <name>plugin.includes</name>
>>   <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|*indexer-solr*|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>>
>> </property>
>>
>> Your problem comes from the fact that you are missing indexer-solr.
>>
>> You should not need  *query-(basic|site|url)|response-(json|xml)|summary-basic *as they date back to times immemorial when we used to manage the indexing and search ourselves.
>>
>> HTH
>>
>> Julien
>>
>>
>> On 10 April 2014 18:05, Xavier Morera <xa...@familiamorera.com> wrote:
>>
>>> Hi,
>>>
>>> I have followed several Nutch tutorials - including the main one
>>> http://wiki.apache.org/nutch/NutchTutorial - to crawl sites (which
>>> works, I can see in the console as the pages get crawled and the
>>> directories built with the data) but for the life of me I can't get
>>> anything posted to Solr. The Solr console doesn't even squint, therefore
>>> Nutch is not sending anything.
>>>
>>> This is the command that I send over that crawls and in theory should
>>> also post
>>> bin/crawl urls/seed.txt TestCrawl http://localhost:8983/solr 2
>>>
>>> But I found that I could also use this one when it is already crawled
>>> bin/nutch solrindex http://localhost:8983/solr crawl/crawldb
>>> crawl/linkdb crawl/segments/*
>>>
>>> But no luck.
>>>
>>> This is the only thing that called my attention but I read that by
>>> adding the property below would work but doesn't work.
>>> *No IndexWriters activated - check your configuration*
>>>
>>> This is the property
>>> <property>
>>> <name>plugin.includes</name>
>>>
>>> <value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>>> </property>
>>>
>>> Any idea? Apache Nutch 1.8 running Java 1.6 via Cygwin on Windows.
>>>
>>> --
>>> *Xavier Morera*
>>> email: xavier@familiamorera.com
>>> CR: +(506) 8849 8866
>>> US: +1 (305) 600 4919
>>> skype: xmorera
>>>
>>>
>>>
>>> --
>>> *Xavier Morera*
>>> email: xavier@familiamorera.com
>>> CR: +(506) 8849 8866
>>> US: +1 (305) 600 4919
>>> skype: xmorera
>>>
>>
>>
>>
>> --
>>
>> Open Source Solutions for Text Engineering
>>
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>> http://twitter.com/digitalpebble
>>
>
>
>
> --
> *Xavier Morera*
> email: xavier@familiamorera.com
> CR: +(506) 8849 8866
> US: +1 (305) 600 4919
> skype: xmorera
>



-- 
*Xavier Morera*
email: xavier@familiamorera.com
CR: +(506) 8849 8866
US: +1 (305) 600 4919
skype: xmorera

Re: Pushing content to Solr from Nutch

Posted by Xavier Morera <xa...@familiamorera.com>.
Thanks Julien and Sebastian. Tried that and got the exception below. Is
there a way of knowing more in detail what is the exception so that I can
continue troubleshooting? I am getting really really close! I also attach
the full output.

This is the exception, but no additional info
Indexer: java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
        at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:114)
        at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:176)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:186)


Also I found this which means that something is actually happening
ndexing 20140410124128 on SOLR index -> http://localhost:8983/solr
cygpath: can't convert empty path
Indexer: starting at 2014-04-10 12:41:42
Indexer: deleting gone documents: false
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
SOLRIndexWriter
        solr.server.url : URL of the SOLR instance (mandatory)
        solr.commit.size : buffer size when sending to SOLR (default 1000)
        solr.mapping.file : name of the mapping file for fields (default
solrindex-mapping.xml)
        solr.auth : use authentication (default false)
        solr.auth.username : use authentication (default false)
        solr.auth : username for authentication
        solr.auth.password : password for authentication


My full nutch-site.xml is

*<?xml version="1.0"?>*
*<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>*
*<!-- Put site-specific property overrides in this file. -->*
*<configuration>*
* <property>*
* <name>http.agent.name <http://http.agent.name></name>*
* <value>nutch-solr-integration</value>*
* </property>*
* <property>*
* <name>generate.max.per.host</name>*
* <value>100</value>*
* </property>*
* <property>*
* <name>plugin.includes</name>*
*
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>*
* </property>*
* <property>*
* <name>fs.file.impl</name>*
*
<value>com.conga.services.hadoop.patch.HADOOP_7682.WinLocalFileSystem</value>*
* <description>Enables patch for issue HADOOP-7682 on Windows</description>*
* </property>*
*</configuration>*

And for urls/site.txt I have
http://www.trenurbano.co.cr

And in regex-urlfilter.txt I have
+^http://([a-z0-9]*\.)*trenurbano.co.cr/

Thanks in advance,
Xavier



On Thu, Apr 10, 2014 at 12:35 PM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> Hi Xavier
>
> Your config file looks a bit outdated. Here are the values set by default
> (see http://svn.apache.org/repos/asf/nutch/trunk/conf/nutch-default.xml)
>
> <property>
>   <name>plugin.includes</name>
>   <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|*indexer-solr*|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>
> </property>
>
> Your problem comes from the fact that you are missing indexer-solr.
>
> You should not need  *query-(basic|site|url)|response-(json|xml)|summary-basic *as they date back to times immemorial when we used to manage the indexing and search ourselves.
>
> HTH
>
> Julien
>
>
> On 10 April 2014 18:05, Xavier Morera <xa...@familiamorera.com> wrote:
>
>> Hi,
>>
>> I have followed several Nutch tutorials - including the main one
>> http://wiki.apache.org/nutch/NutchTutorial - to crawl sites (which
>> works, I can see in the console as the pages get crawled and the
>> directories built with the data) but for the life of me I can't get
>> anything posted to Solr. The Solr console doesn't even squint, therefore
>> Nutch is not sending anything.
>>
>> This is the command that I send over that crawls and in theory should
>> also post
>> bin/crawl urls/seed.txt TestCrawl http://localhost:8983/solr 2
>>
>> But I found that I could also use this one when it is already crawled
>> bin/nutch solrindex http://localhost:8983/solr crawl/crawldb
>> crawl/linkdb crawl/segments/*
>>
>> But no luck.
>>
>> This is the only thing that called my attention but I read that by adding
>> the property below would work but doesn't work.
>> *No IndexWriters activated - check your configuration*
>>
>> This is the property
>> <property>
>> <name>plugin.includes</name>
>>
>> <value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>> </property>
>>
>> Any idea? Apache Nutch 1.8 running Java 1.6 via Cygwin on Windows.
>>
>> --
>> *Xavier Morera*
>> email: xavier@familiamorera.com
>> CR: +(506) 8849 8866
>> US: +1 (305) 600 4919
>> skype: xmorera
>>
>>
>>
>> --
>> *Xavier Morera*
>> email: xavier@familiamorera.com
>> CR: +(506) 8849 8866
>> US: +1 (305) 600 4919
>> skype: xmorera
>>
>
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>



-- 
*Xavier Morera*
email: xavier@familiamorera.com
CR: +(506) 8849 8866
US: +1 (305) 600 4919
skype: xmorera

Re: Pushing content to Solr from Nutch

Posted by Julien Nioche <li...@gmail.com>.
Hi Xavier

Your config file looks a bit outdated. Here are the values set by default
(see http://svn.apache.org/repos/asf/nutch/trunk/conf/nutch-default.xml)

<property>
  <name>plugin.includes</name>
  <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|*indexer-solr*|scoring-opic|urlnormalizer-(pass|regex|basic)</value>

</property>

Your problem comes from the fact that you are missing indexer-solr.

You should not need
*query-(basic|site|url)|response-(json|xml)|summary-basic *as they
date back to times immemorial when we used to manage the indexing and
search ourselves.

HTH

Julien


On 10 April 2014 18:05, Xavier Morera <xa...@familiamorera.com> wrote:

> Hi,
>
> I have followed several Nutch tutorials - including the main one
> http://wiki.apache.org/nutch/NutchTutorial - to crawl sites (which works,
> I can see in the console as the pages get crawled and the directories built
> with the data) but for the life of me I can't get anything posted to Solr.
> The Solr console doesn't even squint, therefore Nutch is not sending
> anything.
>
> This is the command that I send over that crawls and in theory should also
> post
> bin/crawl urls/seed.txt TestCrawl http://localhost:8983/solr 2
>
> But I found that I could also use this one when it is already crawled
> bin/nutch solrindex http://localhost:8983/solr crawl/crawldb crawl/linkdb
> crawl/segments/*
>
> But no luck.
>
> This is the only thing that called my attention but I read that by adding
> the property below would work but doesn't work.
> *No IndexWriters activated - check your configuration*
>
> This is the property
> <property>
> <name>plugin.includes</name>
>
> <value>protocol-http|urlfilter-regex|parse-html|index-(basic|anchor)|query-(basic|site|url)|response-(json|xml)|summary-basic|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> </property>
>
> Any idea? Apache Nutch 1.8 running Java 1.6 via Cygwin on Windows.
>
> --
> *Xavier Morera*
> email: xavier@familiamorera.com
> CR: +(506) 8849 8866
> US: +1 (305) 600 4919
> skype: xmorera
>
>
>
> --
> *Xavier Morera*
> email: xavier@familiamorera.com
> CR: +(506) 8849 8866
> US: +1 (305) 600 4919
> skype: xmorera
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble