You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Kevin Porter <ke...@tinternet.mobi> on 2014/12/29 13:10:00 UTC

crawls but fails at solr indexing

Hi,

I'm new to nutch/solr (although I understand general search engine
concepts, having built a topical search engine previously using WIRE and
swish-e).

I've installed nutch and solr, following the tutorials as best I can (it's
not easy!). I'm having a few problems.

I have nutch crawling with just two sites in the seed.txt: nutch.apache.org
and 9ballpool.co.uk. (for some reason it won't fetch anything but the
robots.txt from 9ballpool.co.uk, but that's not my main problem just now).

I've started solr and started the crawl from the runtime/local dir with:
>./bin/crawl urls/ collection1 http://localhost:8983/solr/ 5

I started solr in the 'example' dir that came with the solr installation.

It appears to be crawling nutch.apache.org, but then it fails on solr
indexing. Here's the last bit of the crawl output:

Parsing http://nutch.apache.org/apidocs/apidocs-2.2/allclasses-frame.html
Parsing http://nutch.apache.org/apidocs/apidocs-2.2/overview-frame.html
Parsing http://nutch.apache.org/apidocs/apidocs-2.2/overview-summary.html
Parsing http://9ballpool.co.uk/
ParserJob: success
CrawlDB update for collection1
DbUpdaterJob: starting
DbUpdaterJob: done
Indexing collection1 on SOLR index -> http://localhost:8983/solr/
SolrIndexerJob: starting
Adding 60 documents
Adding 60 documents
SolrIndexerJob: java.lang.RuntimeException: job failed:
name=[collection1]solr-index, jobid=job_local280747177_0001
        at
org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
        at
org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:46)
        at
org.apache.nutch.indexer.solr.SolrIndexerJob.indexSolr(SolrIndexerJob.java:54)
        at
org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:76)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at
org.apache.nutch.indexer.solr.SolrIndexerJob.main(SolrIndexerJob.java:85)


What am I doing wrong?

thanks,

- Kev


-- 
http://themapps.com

Re: crawls but fails at solr indexing

Posted by Kevin Porter <ke...@tinternet.mobi>.

Answered my own question, turned out to be very simple.
I had to set the fetcher.max.crawl.delay property above it's default (and
not very realistic IMO) value of 30s.

On 30 December 2014 at 16:54, Kevin Porter <ke...@tinternet.mobi> wrote:

>
> OK Thanks for help so far. I'm making headway, slowly.
>
> At the moment nutch won't spider 9ballpool.co.uk (the other two sites I'm
> using in my test crawl are OK). From the output of readdb I can see the
> 'protocolStatus' for that URL is 'ROBOTS_DENIED'. If I remove the
> robots.txt it crawls the site OK.
>
> I started the crawl with:
>
> ./bin/crawl urls/ testCrawl1 http://localhost:8983/solr/collection1 2
>
> If I remove the "crawl-delay: 60" from my robots.txt it works.
>
> Can anyone say why it won't spider 9ballpool.co.uk if I have the
> "crawl-delay: 60" in it?
>
>
>
>
>
>
>
>
>
>
> On 29 December 2014 at 13:52, Kevin Porter <ke...@tinternet.mobi> wrote:
>
>> Thanks, now it *seems* to index, ie:
>> [webdev@themapps local]$ ./bin/nutch solrindex
>> http://localhost:8983/solr/ -all
>> SolrIndexerJob: starting
>> SolrIndexerJob: done.
>>
>>
>> But when I issue a query ("nutch") to solr it doesn't find any matches.
>>
>> How can I find out exactly which URLs are currently in the crawl db? And
>> what is currently in the solr index? I'm pretty sure it did crawl/fetch at
>> least about 80 pages.
>>
>>
>>
>>
>> On 29 December 2014 at 13:19, Chaushu, Shani <sh...@intel.com>
>> wrote:
>>
>>> Now you have error of missing field - add to the schema the field
>>> _version_ with the properties in the error.
>>>
>>>
>>> -----Original Message-----
>>> From: threegarages@gmail.com [mailto:threegarages@gmail.com] On Behalf
>>> Of Kevin Porter
>>> Sent: Monday, December 29, 2014 15:15
>>> To: user
>>> Subject: Re: crawls but fails at solr indexing
>>>
>>> OK thanks I've done that now. Do I need to change anything in there?
>>>
>>> (by the way I'm using Nutch 2.2 and solr 4.10.2)
>>>
>>> Trying to do an index without having to crawl again, I tried this:
>>> >./bin/nutch solrindex http://localhost:8983/solr/ -all
>>>
>>> The errors I got then were:
>>> [webdev@themapps local]$ ./bin/nutch solrindex
>>> http://localhost:8983/solr/ -all
>>> SolrIndexerJob: starting
>>> SolrIndexerJob: org.apache.solr.common.SolrException: {msg=SolrCore
>>> 'collection1' is not available due to init failure: Unable to use
>>> updateLog: _version_ field must exist in schema, using indexed="true" or
>>> docValues="true", stored="true" and multiValued="false" (_version_ does not
>>> exist),trace=org.apache.solr.common.SolrException: SolrCore 'collection1'
>>> is not available due to init failure: Unable to use updateLog: _version_
>>> field must exist in schema, using indexed="true" or docValues="true",
>>> stored="true" and multiValued="false" (_version_ does not exist)    at
>>> org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:745)   at
>>>
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:307)
>>> at
>>>
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
>>> at
>>>
>>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
>>> at
>>>
>>> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
>>> at
>>>
>>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
>>> at org.eclipse.jet
>>>
>>> {msg=SolrCore 'collection1' is not available due to init failure: Unable
>>> to use updateLog: _version_ field must exist in schema, using indexed="true"
>>> or docValues="true", stored="true" and multiValued="false" (_version_
>>> does not exist),trace=org.apache.solr.common.SolrException: SolrCore
>>> 'collection1' is not available due to init failure: Unable to use
>>> updateLog: _version_ field must exist in schema, using indexed="true" or
>>> docValues="true", stored="true" and multiValued="false" (_version_ does not
>>> exist)        at
>>> org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:745)     at
>>>
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:307)
>>> at
>>>
>>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
>>> at
>>>
>>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
>>> at
>>>
>>> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
>>> at
>>>
>>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
>>> at org.eclipse.jet
>>>
>>> request: http://localhost:8983/solr/update
>>>         at
>>>
>>> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)
>>>         at
>>>
>>> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
>>>         at
>>>
>>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
>>>         at
>>> org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:86)
>>>         at
>>> org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:75)
>>>         at
>>>
>>> org.apache.nutch.indexer.solr.SolrIndexerJob.indexSolr(SolrIndexerJob.java:61)
>>>         at
>>> org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:76)
>>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>>         at
>>> org.apache.nutch.indexer.solr.SolrIndexerJob.main(SolrIndexerJob.java:85)
>>>
>>>
>>>
>>>
>>>
>>>
>>> On 29 December 2014 at 12:41, Chaushu, Shani <sh...@intel.com>
>>> wrote:
>>>
>>> > The nutch inject the urls into solr. So, the Solr schema should be
>>> > ready for nutch schema. Inside the nutch/conf folder there is
>>> > schema-solr4.xml file. You need to override the Solr schema with this
>>> > file - copy it and rename it to be the new schema.xml
>>> >
>>> > -----Original Message-----
>>> > From: threegarages@gmail.com [mailto:threegarages@gmail.com] On Behalf
>>> > Of Kevin Porter
>>> > Sent: Monday, December 29, 2014 14:37
>>> > To: user
>>> > Subject: Re: crawls but fails at solr indexing
>>> >
>>> > I tried changing a few things (not easy to make sense of the various
>>> > contradictory or out of date tutorials), nothing worked. At present I
>>> > think the only change is I changed solr's schema.xml "uniquekey" tag
>>> to 'url'
>>> > instead of 'id'.
>>> >
>>> > Do you mean the schema.xml in nutch or solr?
>>> >
>>> > Can you tell me definitively which schema.xml to change and what
>>> > changes to make?
>>> >
>>> >
>>> >
>>> > On 29 December 2014 at 12:17, Chaushu, Shani <sh...@intel.com>
>>> > wrote:
>>> >
>>> > > Did you change the schema.xml?
>>> > >
>>> > > -----Original Message-----
>>> > > From: threegarages@gmail.com [mailto:threegarages@gmail.com] On
>>> > > Behalf Of Kevin Porter
>>> > > Sent: Monday, December 29, 2014 14:10
>>> > > To: user@nutch.apache.org
>>> > > Subject: crawls but fails at solr indexing
>>> > >
>>> > > Hi,
>>> > >
>>> > > I'm new to nutch/solr (although I understand general search engine
>>> > > concepts, having built a topical search engine previously using WIRE
>>> > > and swish-e).
>>> > >
>>> > > I've installed nutch and solr, following the tutorials as best I can
>>> > > (it's not easy!). I'm having a few problems.
>>> > >
>>> > > I have nutch crawling with just two sites in the seed.txt:
>>> > > nutch.apache.org and 9ballpool.co.uk. (for some reason it won't
>>> > > fetch anything but the robots.txt from 9ballpool.co.uk, but that's
>>> > > not my main problem just now).
>>> > >
>>> > > I've started solr and started the crawl from the runtime/local dir
>>> with:
>>> > > >./bin/crawl urls/ collection1 http://localhost:8983/solr/ 5
>>> > >
>>> > > I started solr in the 'example' dir that came with the solr
>>> installation.
>>> > >
>>> > > It appears to be crawling nutch.apache.org, but then it fails on
>>> > > solr indexing. Here's the last bit of the crawl output:
>>> > >
>>> > > Parsing
>>> > > http://nutch.apache.org/apidocs/apidocs-2.2/allclasses-frame.html
>>> > > Parsing
>>> > > http://nutch.apache.org/apidocs/apidocs-2.2/overview-frame.html
>>> > > Parsing
>>> > > http://nutch.apache.org/apidocs/apidocs-2.2/overview-summary.html
>>> > > Parsing http://9ballpool.co.uk/
>>> > > ParserJob: success
>>> > > CrawlDB update for collection1
>>> > > DbUpdaterJob: starting
>>> > > DbUpdaterJob: done
>>> > > Indexing collection1 on SOLR index -> http://localhost:8983/solr/
>>> > > SolrIndexerJob: starting
>>> > > Adding 60 documents
>>> > > Adding 60 documents
>>> > > SolrIndexerJob: java.lang.RuntimeException: job failed:
>>> > > name=[collection1]solr-index, jobid=job_local280747177_0001
>>> > >         at
>>> > > org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
>>> > >         at
>>> > >
>>> org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:46)
>>> > >         at
>>> > >
>>> > >
>>> > org.apache.nutch.indexer.solr.SolrIndexerJob.indexSolr(SolrIndexerJob.
>>> > java:54)
>>> > >         at
>>> > >
>>> org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:76)
>>> > >         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>> > >         at
>>> > >
>>> org.apache.nutch.indexer.solr.SolrIndexerJob.main(SolrIndexerJob.java:
>>> > > 85)
>>> > >
>>> > >
>>> > > What am I doing wrong?
>>> > >
>>> > > thanks,
>>> > >
>>> > > - Kev
>>> > >
>>> > >
>>> > > --
>>> > > http://themapps.com
>>> > > --------------------------------------------------------------------
>>> > > -
>>> > > Intel Electronics Ltd.
>>> > >
>>> > > This e-mail and any attachments may contain confidential material
>>> > > for the sole use of the intended recipient(s). Any review or
>>> > > distribution by others is strictly prohibited. If you are not the
>>> > > intended recipient, please contact the sender and delete all copies.
>>> > >
>>> >
>>> >
>>> >
>>> > --
>>> > http://themapps.com
>>> > ---------------------------------------------------------------------
>>> > Intel Electronics Ltd.
>>> >
>>> > This e-mail and any attachments may contain confidential material for
>>> > the sole use of the intended recipient(s). Any review or distribution
>>> > by others is strictly prohibited. If you are not the intended
>>> > recipient, please contact the sender and delete all copies.
>>> >
>>>
>>>
>>>
>>> --
>>> http://themapps.com
>>> ---------------------------------------------------------------------
>>> Intel Electronics Ltd.
>>>
>>> This e-mail and any attachments may contain confidential material for
>>> the sole use of the intended recipient(s). Any review or distribution
>>> by others is strictly prohibited. If you are not the intended
>>> recipient, please contact the sender and delete all copies.
>>>
>>
>>
>>
>> --
>> http://themapps.com
>>
>
>
>
> --
> http://themapps.com
>



-- 
http://themapps.com

Re: crawls but fails at solr indexing

Posted by Kevin Porter <ke...@tinternet.mobi>.

OK Thanks for help so far. I'm making headway, slowly.

At the moment nutch won't spider 9ballpool.co.uk (the other two sites I'm
using in my test crawl are OK). From the output of readdb I can see the
'protocolStatus' for that URL is 'ROBOTS_DENIED'. If I remove the
robots.txt it crawls the site OK.

I started the crawl with:

./bin/crawl urls/ testCrawl1 http://localhost:8983/solr/collection1 2

If I remove the "crawl-delay: 60" from my robots.txt it works.

Can anyone say why it won't spider 9ballpool.co.uk if I have the
"crawl-delay: 60" in it?










On 29 December 2014 at 13:52, Kevin Porter <ke...@tinternet.mobi> wrote:

> Thanks, now it *seems* to index, ie:
> [webdev@themapps local]$ ./bin/nutch solrindex http://localhost:8983/solr/
> -all
> SolrIndexerJob: starting
> SolrIndexerJob: done.
>
>
> But when I issue a query ("nutch") to solr it doesn't find any matches.
>
> How can I find out exactly which URLs are currently in the crawl db? And
> what is currently in the solr index? I'm pretty sure it did crawl/fetch at
> least about 80 pages.
>
>
>
>
> On 29 December 2014 at 13:19, Chaushu, Shani <sh...@intel.com>
> wrote:
>
>> Now you have error of missing field - add to the schema the field
>> _version_ with the properties in the error.
>>
>>
>> -----Original Message-----
>> From: threegarages@gmail.com [mailto:threegarages@gmail.com] On Behalf
>> Of Kevin Porter
>> Sent: Monday, December 29, 2014 15:15
>> To: user
>> Subject: Re: crawls but fails at solr indexing
>>
>> OK thanks I've done that now. Do I need to change anything in there?
>>
>> (by the way I'm using Nutch 2.2 and solr 4.10.2)
>>
>> Trying to do an index without having to crawl again, I tried this:
>> >./bin/nutch solrindex http://localhost:8983/solr/ -all
>>
>> The errors I got then were:
>> [webdev@themapps local]$ ./bin/nutch solrindex
>> http://localhost:8983/solr/ -all
>> SolrIndexerJob: starting
>> SolrIndexerJob: org.apache.solr.common.SolrException: {msg=SolrCore
>> 'collection1' is not available due to init failure: Unable to use
>> updateLog: _version_ field must exist in schema, using indexed="true" or
>> docValues="true", stored="true" and multiValued="false" (_version_ does not
>> exist),trace=org.apache.solr.common.SolrException: SolrCore 'collection1'
>> is not available due to init failure: Unable to use updateLog: _version_
>> field must exist in schema, using indexed="true" or docValues="true",
>> stored="true" and multiValued="false" (_version_ does not exist)    at
>> org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:745)   at
>>
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:307)
>> at
>>
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
>> at
>>
>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
>> at
>> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
>> at
>>
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
>> at org.eclipse.jet
>>
>> {msg=SolrCore 'collection1' is not available due to init failure: Unable
>> to use updateLog: _version_ field must exist in schema, using indexed="true"
>> or docValues="true", stored="true" and multiValued="false" (_version_
>> does not exist),trace=org.apache.solr.common.SolrException: SolrCore
>> 'collection1' is not available due to init failure: Unable to use
>> updateLog: _version_ field must exist in schema, using indexed="true" or
>> docValues="true", stored="true" and multiValued="false" (_version_ does not
>> exist)        at
>> org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:745)     at
>>
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:307)
>> at
>>
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
>> at
>>
>> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
>> at
>> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
>> at
>>
>> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
>> at org.eclipse.jet
>>
>> request: http://localhost:8983/solr/update
>>         at
>>
>> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)
>>         at
>>
>> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
>>         at
>>
>> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
>>         at
>> org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:86)
>>         at
>> org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:75)
>>         at
>>
>> org.apache.nutch.indexer.solr.SolrIndexerJob.indexSolr(SolrIndexerJob.java:61)
>>         at
>> org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:76)
>>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>>         at
>> org.apache.nutch.indexer.solr.SolrIndexerJob.main(SolrIndexerJob.java:85)
>>
>>
>>
>>
>>
>>
>> On 29 December 2014 at 12:41, Chaushu, Shani <sh...@intel.com>
>> wrote:
>>
>> > The nutch inject the urls into solr. So, the Solr schema should be
>> > ready for nutch schema. Inside the nutch/conf folder there is
>> > schema-solr4.xml file. You need to override the Solr schema with this
>> > file - copy it and rename it to be the new schema.xml
>> >
>> > -----Original Message-----
>> > From: threegarages@gmail.com [mailto:threegarages@gmail.com] On Behalf
>> > Of Kevin Porter
>> > Sent: Monday, December 29, 2014 14:37
>> > To: user
>> > Subject: Re: crawls but fails at solr indexing
>> >
>> > I tried changing a few things (not easy to make sense of the various
>> > contradictory or out of date tutorials), nothing worked. At present I
>> > think the only change is I changed solr's schema.xml "uniquekey" tag to
>> 'url'
>> > instead of 'id'.
>> >
>> > Do you mean the schema.xml in nutch or solr?
>> >
>> > Can you tell me definitively which schema.xml to change and what
>> > changes to make?
>> >
>> >
>> >
>> > On 29 December 2014 at 12:17, Chaushu, Shani <sh...@intel.com>
>> > wrote:
>> >
>> > > Did you change the schema.xml?
>> > >
>> > > -----Original Message-----
>> > > From: threegarages@gmail.com [mailto:threegarages@gmail.com] On
>> > > Behalf Of Kevin Porter
>> > > Sent: Monday, December 29, 2014 14:10
>> > > To: user@nutch.apache.org
>> > > Subject: crawls but fails at solr indexing
>> > >
>> > > Hi,
>> > >
>> > > I'm new to nutch/solr (although I understand general search engine
>> > > concepts, having built a topical search engine previously using WIRE
>> > > and swish-e).
>> > >
>> > > I've installed nutch and solr, following the tutorials as best I can
>> > > (it's not easy!). I'm having a few problems.
>> > >
>> > > I have nutch crawling with just two sites in the seed.txt:
>> > > nutch.apache.org and 9ballpool.co.uk. (for some reason it won't
>> > > fetch anything but the robots.txt from 9ballpool.co.uk, but that's
>> > > not my main problem just now).
>> > >
>> > > I've started solr and started the crawl from the runtime/local dir
>> with:
>> > > >./bin/crawl urls/ collection1 http://localhost:8983/solr/ 5
>> > >
>> > > I started solr in the 'example' dir that came with the solr
>> installation.
>> > >
>> > > It appears to be crawling nutch.apache.org, but then it fails on
>> > > solr indexing. Here's the last bit of the crawl output:
>> > >
>> > > Parsing
>> > > http://nutch.apache.org/apidocs/apidocs-2.2/allclasses-frame.html
>> > > Parsing
>> > > http://nutch.apache.org/apidocs/apidocs-2.2/overview-frame.html
>> > > Parsing
>> > > http://nutch.apache.org/apidocs/apidocs-2.2/overview-summary.html
>> > > Parsing http://9ballpool.co.uk/
>> > > ParserJob: success
>> > > CrawlDB update for collection1
>> > > DbUpdaterJob: starting
>> > > DbUpdaterJob: done
>> > > Indexing collection1 on SOLR index -> http://localhost:8983/solr/
>> > > SolrIndexerJob: starting
>> > > Adding 60 documents
>> > > Adding 60 documents
>> > > SolrIndexerJob: java.lang.RuntimeException: job failed:
>> > > name=[collection1]solr-index, jobid=job_local280747177_0001
>> > >         at
>> > > org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
>> > >         at
>> > >
>> org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:46)
>> > >         at
>> > >
>> > >
>> > org.apache.nutch.indexer.solr.SolrIndexerJob.indexSolr(SolrIndexerJob.
>> > java:54)
>> > >         at
>> > >
>> org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:76)
>> > >         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>> > >         at
>> > > org.apache.nutch.indexer.solr.SolrIndexerJob.main(SolrIndexerJob.java:
>> > > 85)
>> > >
>> > >
>> > > What am I doing wrong?
>> > >
>> > > thanks,
>> > >
>> > > - Kev
>> > >
>> > >
>> > > --
>> > > http://themapps.com
>> > > --------------------------------------------------------------------
>> > > -
>> > > Intel Electronics Ltd.
>> > >
>> > > This e-mail and any attachments may contain confidential material
>> > > for the sole use of the intended recipient(s). Any review or
>> > > distribution by others is strictly prohibited. If you are not the
>> > > intended recipient, please contact the sender and delete all copies.
>> > >
>> >
>> >
>> >
>> > --
>> > http://themapps.com
>> > ---------------------------------------------------------------------
>> > Intel Electronics Ltd.
>> >
>> > This e-mail and any attachments may contain confidential material for
>> > the sole use of the intended recipient(s). Any review or distribution
>> > by others is strictly prohibited. If you are not the intended
>> > recipient, please contact the sender and delete all copies.
>> >
>>
>>
>>
>> --
>> http://themapps.com
>> ---------------------------------------------------------------------
>> Intel Electronics Ltd.
>>
>> This e-mail and any attachments may contain confidential material for
>> the sole use of the intended recipient(s). Any review or distribution
>> by others is strictly prohibited. If you are not the intended
>> recipient, please contact the sender and delete all copies.
>>
>
>
>
> --
> http://themapps.com
>



-- 
http://themapps.com

Re: crawls but fails at solr indexing

Posted by Kevin Porter <ke...@tinternet.mobi>.

Thanks, now it *seems* to index, ie:
[webdev@themapps local]$ ./bin/nutch solrindex http://localhost:8983/solr/
-all
SolrIndexerJob: starting
SolrIndexerJob: done.


But when I issue a query ("nutch") to solr it doesn't find any matches.

How can I find out exactly which URLs are currently in the crawl db? And
what is currently in the solr index? I'm pretty sure it did crawl/fetch at
least about 80 pages.




On 29 December 2014 at 13:19, Chaushu, Shani <sh...@intel.com>
wrote:

> Now you have error of missing field - add to the schema the field
> _version_ with the properties in the error.
>
>
> -----Original Message-----
> From: threegarages@gmail.com [mailto:threegarages@gmail.com] On Behalf Of
> Kevin Porter
> Sent: Monday, December 29, 2014 15:15
> To: user
> Subject: Re: crawls but fails at solr indexing
>
> OK thanks I've done that now. Do I need to change anything in there?
>
> (by the way I'm using Nutch 2.2 and solr 4.10.2)
>
> Trying to do an index without having to crawl again, I tried this:
> >./bin/nutch solrindex http://localhost:8983/solr/ -all
>
> The errors I got then were:
> [webdev@themapps local]$ ./bin/nutch solrindex http://localhost:8983/solr/
> -all
> SolrIndexerJob: starting
> SolrIndexerJob: org.apache.solr.common.SolrException: {msg=SolrCore
> 'collection1' is not available due to init failure: Unable to use
> updateLog: _version_ field must exist in schema, using indexed="true" or
> docValues="true", stored="true" and multiValued="false" (_version_ does not
> exist),trace=org.apache.solr.common.SolrException: SolrCore 'collection1'
> is not available due to init failure: Unable to use updateLog: _version_
> field must exist in schema, using indexed="true" or docValues="true",
> stored="true" and multiValued="false" (_version_ does not exist)    at
> org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:745)   at
>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:307)
> at
>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
> at
>
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
> at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
> at
>
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
> at org.eclipse.jet
>
> {msg=SolrCore 'collection1' is not available due to init failure: Unable
> to use updateLog: _version_ field must exist in schema, using indexed="true"
> or docValues="true", stored="true" and multiValued="false" (_version_ does
> not exist),trace=org.apache.solr.common.SolrException: SolrCore
> 'collection1' is not available due to init failure: Unable to use
> updateLog: _version_ field must exist in schema, using indexed="true" or
> docValues="true", stored="true" and multiValued="false" (_version_ does not
> exist)        at
> org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:745)     at
>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:307)
> at
>
> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
> at
>
> org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
> at
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
> at
>
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
> at org.eclipse.jet
>
> request: http://localhost:8983/solr/update
>         at
>
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)
>         at
>
> org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
>         at
>
> org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
>         at
> org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:86)
>         at
> org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:75)
>         at
>
> org.apache.nutch.indexer.solr.SolrIndexerJob.indexSolr(SolrIndexerJob.java:61)
>         at
> org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:76)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at
> org.apache.nutch.indexer.solr.SolrIndexerJob.main(SolrIndexerJob.java:85)
>
>
>
>
>
>
> On 29 December 2014 at 12:41, Chaushu, Shani <sh...@intel.com>
> wrote:
>
> > The nutch inject the urls into solr. So, the Solr schema should be
> > ready for nutch schema. Inside the nutch/conf folder there is
> > schema-solr4.xml file. You need to override the Solr schema with this
> > file - copy it and rename it to be the new schema.xml
> >
> > -----Original Message-----
> > From: threegarages@gmail.com [mailto:threegarages@gmail.com] On Behalf
> > Of Kevin Porter
> > Sent: Monday, December 29, 2014 14:37
> > To: user
> > Subject: Re: crawls but fails at solr indexing
> >
> > I tried changing a few things (not easy to make sense of the various
> > contradictory or out of date tutorials), nothing worked. At present I
> > think the only change is I changed solr's schema.xml "uniquekey" tag to
> 'url'
> > instead of 'id'.
> >
> > Do you mean the schema.xml in nutch or solr?
> >
> > Can you tell me definitively which schema.xml to change and what
> > changes to make?
> >
> >
> >
> > On 29 December 2014 at 12:17, Chaushu, Shani <sh...@intel.com>
> > wrote:
> >
> > > Did you change the schema.xml?
> > >
> > > -----Original Message-----
> > > From: threegarages@gmail.com [mailto:threegarages@gmail.com] On
> > > Behalf Of Kevin Porter
> > > Sent: Monday, December 29, 2014 14:10
> > > To: user@nutch.apache.org
> > > Subject: crawls but fails at solr indexing
> > >
> > > Hi,
> > >
> > > I'm new to nutch/solr (although I understand general search engine
> > > concepts, having built a topical search engine previously using WIRE
> > > and swish-e).
> > >
> > > I've installed nutch and solr, following the tutorials as best I can
> > > (it's not easy!). I'm having a few problems.
> > >
> > > I have nutch crawling with just two sites in the seed.txt:
> > > nutch.apache.org and 9ballpool.co.uk. (for some reason it won't
> > > fetch anything but the robots.txt from 9ballpool.co.uk, but that's
> > > not my main problem just now).
> > >
> > > I've started solr and started the crawl from the runtime/local dir
> with:
> > > >./bin/crawl urls/ collection1 http://localhost:8983/solr/ 5
> > >
> > > I started solr in the 'example' dir that came with the solr
> installation.
> > >
> > > It appears to be crawling nutch.apache.org, but then it fails on
> > > solr indexing. Here's the last bit of the crawl output:
> > >
> > > Parsing
> > > http://nutch.apache.org/apidocs/apidocs-2.2/allclasses-frame.html
> > > Parsing
> > > http://nutch.apache.org/apidocs/apidocs-2.2/overview-frame.html
> > > Parsing
> > > http://nutch.apache.org/apidocs/apidocs-2.2/overview-summary.html
> > > Parsing http://9ballpool.co.uk/
> > > ParserJob: success
> > > CrawlDB update for collection1
> > > DbUpdaterJob: starting
> > > DbUpdaterJob: done
> > > Indexing collection1 on SOLR index -> http://localhost:8983/solr/
> > > SolrIndexerJob: starting
> > > Adding 60 documents
> > > Adding 60 documents
> > > SolrIndexerJob: java.lang.RuntimeException: job failed:
> > > name=[collection1]solr-index, jobid=job_local280747177_0001
> > >         at
> > > org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
> > >         at
> > >
> org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:46)
> > >         at
> > >
> > >
> > org.apache.nutch.indexer.solr.SolrIndexerJob.indexSolr(SolrIndexerJob.
> > java:54)
> > >         at
> > >
> org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:76)
> > >         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> > >         at
> > > org.apache.nutch.indexer.solr.SolrIndexerJob.main(SolrIndexerJob.java:
> > > 85)
> > >
> > >
> > > What am I doing wrong?
> > >
> > > thanks,
> > >
> > > - Kev
> > >
> > >
> > > --
> > > http://themapps.com
> > > --------------------------------------------------------------------
> > > -
> > > Intel Electronics Ltd.
> > >
> > > This e-mail and any attachments may contain confidential material
> > > for the sole use of the intended recipient(s). Any review or
> > > distribution by others is strictly prohibited. If you are not the
> > > intended recipient, please contact the sender and delete all copies.
> > >
> >
> >
> >
> > --
> > http://themapps.com
> > ---------------------------------------------------------------------
> > Intel Electronics Ltd.
> >
> > This e-mail and any attachments may contain confidential material for
> > the sole use of the intended recipient(s). Any review or distribution
> > by others is strictly prohibited. If you are not the intended
> > recipient, please contact the sender and delete all copies.
> >
>
>
>
> --
> http://themapps.com
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>



-- 
http://themapps.com

RE: crawls but fails at solr indexing

Posted by "Chaushu, Shani" <sh...@intel.com>.

Now you have error of missing field - add to the schema the field _version_ with the properties in the error. 


-----Original Message-----
From: threegarages@gmail.com [mailto:threegarages@gmail.com] On Behalf Of Kevin Porter
Sent: Monday, December 29, 2014 15:15
To: user
Subject: Re: crawls but fails at solr indexing

OK thanks I've done that now. Do I need to change anything in there?

(by the way I'm using Nutch 2.2 and solr 4.10.2)

Trying to do an index without having to crawl again, I tried this:
>./bin/nutch solrindex http://localhost:8983/solr/ -all

The errors I got then were:
[webdev@themapps local]$ ./bin/nutch solrindex http://localhost:8983/solr/ -all
SolrIndexerJob: starting
SolrIndexerJob: org.apache.solr.common.SolrException: {msg=SolrCore 'collection1' is not available due to init failure: Unable to use
updateLog: _version_ field must exist in schema, using indexed="true" or docValues="true", stored="true" and multiValued="false" (_version_ does not
exist),trace=org.apache.solr.common.SolrException: SolrCore 'collection1'
is not available due to init failure: Unable to use updateLog: _version_ field must exist in schema, using indexed="true" or docValues="true",
stored="true" and multiValued="false" (_version_ does not exist)    at
org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:745)   at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:307)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at org.eclipse.jet

{msg=SolrCore 'collection1' is not available due to init failure: Unable to use updateLog: _version_ field must exist in schema, using indexed="true"
or docValues="true", stored="true" and multiValued="false" (_version_ does not exist),trace=org.apache.solr.common.SolrException: SolrCore 'collection1' is not available due to init failure: Unable to use
updateLog: _version_ field must exist in schema, using indexed="true" or docValues="true", stored="true" and multiValued="false" (_version_ does not
exist)        at
org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:745)     at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:307)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at org.eclipse.jet

request: http://localhost:8983/solr/update
        at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)
        at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
        at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
        at
org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:86)
        at
org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:75)
        at
org.apache.nutch.indexer.solr.SolrIndexerJob.indexSolr(SolrIndexerJob.java:61)
        at
org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:76)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at
org.apache.nutch.indexer.solr.SolrIndexerJob.main(SolrIndexerJob.java:85)






On 29 December 2014 at 12:41, Chaushu, Shani <sh...@intel.com>
wrote:

> The nutch inject the urls into solr. So, the Solr schema should be 
> ready for nutch schema. Inside the nutch/conf folder there is 
> schema-solr4.xml file. You need to override the Solr schema with this 
> file - copy it and rename it to be the new schema.xml
>
> -----Original Message-----
> From: threegarages@gmail.com [mailto:threegarages@gmail.com] On Behalf 
> Of Kevin Porter
> Sent: Monday, December 29, 2014 14:37
> To: user
> Subject: Re: crawls but fails at solr indexing
>
> I tried changing a few things (not easy to make sense of the various 
> contradictory or out of date tutorials), nothing worked. At present I 
> think the only change is I changed solr's schema.xml "uniquekey" tag to 'url'
> instead of 'id'.
>
> Do you mean the schema.xml in nutch or solr?
>
> Can you tell me definitively which schema.xml to change and what 
> changes to make?
>
>
>
> On 29 December 2014 at 12:17, Chaushu, Shani <sh...@intel.com>
> wrote:
>
> > Did you change the schema.xml?
> >
> > -----Original Message-----
> > From: threegarages@gmail.com [mailto:threegarages@gmail.com] On 
> > Behalf Of Kevin Porter
> > Sent: Monday, December 29, 2014 14:10
> > To: user@nutch.apache.org
> > Subject: crawls but fails at solr indexing
> >
> > Hi,
> >
> > I'm new to nutch/solr (although I understand general search engine 
> > concepts, having built a topical search engine previously using WIRE 
> > and swish-e).
> >
> > I've installed nutch and solr, following the tutorials as best I can 
> > (it's not easy!). I'm having a few problems.
> >
> > I have nutch crawling with just two sites in the seed.txt:
> > nutch.apache.org and 9ballpool.co.uk. (for some reason it won't 
> > fetch anything but the robots.txt from 9ballpool.co.uk, but that's 
> > not my main problem just now).
> >
> > I've started solr and started the crawl from the runtime/local dir with:
> > >./bin/crawl urls/ collection1 http://localhost:8983/solr/ 5
> >
> > I started solr in the 'example' dir that came with the solr installation.
> >
> > It appears to be crawling nutch.apache.org, but then it fails on 
> > solr indexing. Here's the last bit of the crawl output:
> >
> > Parsing
> > http://nutch.apache.org/apidocs/apidocs-2.2/allclasses-frame.html
> > Parsing
> > http://nutch.apache.org/apidocs/apidocs-2.2/overview-frame.html
> > Parsing
> > http://nutch.apache.org/apidocs/apidocs-2.2/overview-summary.html
> > Parsing http://9ballpool.co.uk/
> > ParserJob: success
> > CrawlDB update for collection1
> > DbUpdaterJob: starting
> > DbUpdaterJob: done
> > Indexing collection1 on SOLR index -> http://localhost:8983/solr/
> > SolrIndexerJob: starting
> > Adding 60 documents
> > Adding 60 documents
> > SolrIndexerJob: java.lang.RuntimeException: job failed:
> > name=[collection1]solr-index, jobid=job_local280747177_0001
> >         at
> > org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
> >         at
> > org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:46)
> >         at
> >
> >
> org.apache.nutch.indexer.solr.SolrIndexerJob.indexSolr(SolrIndexerJob.
> java:54)
> >         at
> > org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:76)
> >         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >         at
> > org.apache.nutch.indexer.solr.SolrIndexerJob.main(SolrIndexerJob.java:
> > 85)
> >
> >
> > What am I doing wrong?
> >
> > thanks,
> >
> > - Kev
> >
> >
> > --
> > http://themapps.com
> > --------------------------------------------------------------------
> > -
> > Intel Electronics Ltd.
> >
> > This e-mail and any attachments may contain confidential material 
> > for the sole use of the intended recipient(s). Any review or 
> > distribution by others is strictly prohibited. If you are not the 
> > intended recipient, please contact the sender and delete all copies.
> >
>
>
>
> --
> http://themapps.com
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
>
> This e-mail and any attachments may contain confidential material for 
> the sole use of the intended recipient(s). Any review or distribution 
> by others is strictly prohibited. If you are not the intended 
> recipient, please contact the sender and delete all copies.
>



--
http://themapps.com
---------------------------------------------------------------------
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Re: crawls but fails at solr indexing

Posted by Kevin Porter <ke...@tinternet.mobi>.

OK thanks I've done that now. Do I need to change anything in there?

(by the way I'm using Nutch 2.2 and solr 4.10.2)

Trying to do an index without having to crawl again, I tried this:
>./bin/nutch solrindex http://localhost:8983/solr/ -all

The errors I got then were:
[webdev@themapps local]$ ./bin/nutch solrindex http://localhost:8983/solr/
-all
SolrIndexerJob: starting
SolrIndexerJob: org.apache.solr.common.SolrException: {msg=SolrCore
'collection1' is not available due to init failure: Unable to use
updateLog: _version_ field must exist in schema, using indexed="true" or
docValues="true", stored="true" and multiValued="false" (_version_ does not
exist),trace=org.apache.solr.common.SolrException: SolrCore 'collection1'
is not available due to init failure: Unable to use updateLog: _version_
field must exist in schema, using indexed="true" or docValues="true",
stored="true" and multiValued="false" (_version_ does not exist)    at
org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:745)   at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:307)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at org.eclipse.jet

{msg=SolrCore 'collection1' is not available due to init failure: Unable to
use updateLog: _version_ field must exist in schema, using indexed="true"
or docValues="true", stored="true" and multiValued="false" (_version_ does
not exist),trace=org.apache.solr.common.SolrException: SolrCore
'collection1' is not available due to init failure: Unable to use
updateLog: _version_ field must exist in schema, using indexed="true" or
docValues="true", stored="true" and multiValued="false" (_version_ does not
exist)        at
org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:745)     at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:307)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at org.eclipse.jet

request: http://localhost:8983/solr/update
        at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:430)
        at
org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:244)
        at
org.apache.solr.client.solrj.request.AbstractUpdateRequest.process(AbstractUpdateRequest.java:105)
        at
org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:86)
        at
org.apache.solr.client.solrj.SolrServer.commit(SolrServer.java:75)
        at
org.apache.nutch.indexer.solr.SolrIndexerJob.indexSolr(SolrIndexerJob.java:61)
        at
org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:76)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at
org.apache.nutch.indexer.solr.SolrIndexerJob.main(SolrIndexerJob.java:85)






On 29 December 2014 at 12:41, Chaushu, Shani <sh...@intel.com>
wrote:

> The nutch inject the urls into solr. So, the Solr schema should be ready
> for nutch schema. Inside the nutch/conf folder there is schema-solr4.xml
> file. You need to override the Solr schema with this file - copy it and
> rename it to be the new schema.xml
>
> -----Original Message-----
> From: threegarages@gmail.com [mailto:threegarages@gmail.com] On Behalf Of
> Kevin Porter
> Sent: Monday, December 29, 2014 14:37
> To: user
> Subject: Re: crawls but fails at solr indexing
>
> I tried changing a few things (not easy to make sense of the various
> contradictory or out of date tutorials), nothing worked. At present I think
> the only change is I changed solr's schema.xml "uniquekey" tag to 'url'
> instead of 'id'.
>
> Do you mean the schema.xml in nutch or solr?
>
> Can you tell me definitively which schema.xml to change and what changes
> to make?
>
>
>
> On 29 December 2014 at 12:17, Chaushu, Shani <sh...@intel.com>
> wrote:
>
> > Did you change the schema.xml?
> >
> > -----Original Message-----
> > From: threegarages@gmail.com [mailto:threegarages@gmail.com] On Behalf
> > Of Kevin Porter
> > Sent: Monday, December 29, 2014 14:10
> > To: user@nutch.apache.org
> > Subject: crawls but fails at solr indexing
> >
> > Hi,
> >
> > I'm new to nutch/solr (although I understand general search engine
> > concepts, having built a topical search engine previously using WIRE
> > and swish-e).
> >
> > I've installed nutch and solr, following the tutorials as best I can
> > (it's not easy!). I'm having a few problems.
> >
> > I have nutch crawling with just two sites in the seed.txt:
> > nutch.apache.org and 9ballpool.co.uk. (for some reason it won't fetch
> > anything but the robots.txt from 9ballpool.co.uk, but that's not my
> > main problem just now).
> >
> > I've started solr and started the crawl from the runtime/local dir with:
> > >./bin/crawl urls/ collection1 http://localhost:8983/solr/ 5
> >
> > I started solr in the 'example' dir that came with the solr installation.
> >
> > It appears to be crawling nutch.apache.org, but then it fails on solr
> > indexing. Here's the last bit of the crawl output:
> >
> > Parsing
> > http://nutch.apache.org/apidocs/apidocs-2.2/allclasses-frame.html
> > Parsing
> > http://nutch.apache.org/apidocs/apidocs-2.2/overview-frame.html
> > Parsing
> > http://nutch.apache.org/apidocs/apidocs-2.2/overview-summary.html
> > Parsing http://9ballpool.co.uk/
> > ParserJob: success
> > CrawlDB update for collection1
> > DbUpdaterJob: starting
> > DbUpdaterJob: done
> > Indexing collection1 on SOLR index -> http://localhost:8983/solr/
> > SolrIndexerJob: starting
> > Adding 60 documents
> > Adding 60 documents
> > SolrIndexerJob: java.lang.RuntimeException: job failed:
> > name=[collection1]solr-index, jobid=job_local280747177_0001
> >         at
> > org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
> >         at
> > org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:46)
> >         at
> >
> >
> org.apache.nutch.indexer.solr.SolrIndexerJob.indexSolr(SolrIndexerJob.java:54)
> >         at
> > org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:76)
> >         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
> >         at
> > org.apache.nutch.indexer.solr.SolrIndexerJob.main(SolrIndexerJob.java:
> > 85)
> >
> >
> > What am I doing wrong?
> >
> > thanks,
> >
> > - Kev
> >
> >
> > --
> > http://themapps.com
> > ---------------------------------------------------------------------
> > Intel Electronics Ltd.
> >
> > This e-mail and any attachments may contain confidential material for
> > the sole use of the intended recipient(s). Any review or distribution
> > by others is strictly prohibited. If you are not the intended
> > recipient, please contact the sender and delete all copies.
> >
>
>
>
> --
> http://themapps.com
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>



-- 
http://themapps.com

RE: crawls but fails at solr indexing

Posted by "Chaushu, Shani" <sh...@intel.com>.

The nutch inject the urls into solr. So, the Solr schema should be ready for nutch schema. Inside the nutch/conf folder there is schema-solr4.xml file. You need to override the Solr schema with this file - copy it and rename it to be the new schema.xml

-----Original Message-----
From: threegarages@gmail.com [mailto:threegarages@gmail.com] On Behalf Of Kevin Porter
Sent: Monday, December 29, 2014 14:37
To: user
Subject: Re: crawls but fails at solr indexing

I tried changing a few things (not easy to make sense of the various contradictory or out of date tutorials), nothing worked. At present I think the only change is I changed solr's schema.xml "uniquekey" tag to 'url'
instead of 'id'.

Do you mean the schema.xml in nutch or solr?

Can you tell me definitively which schema.xml to change and what changes to make?



On 29 December 2014 at 12:17, Chaushu, Shani <sh...@intel.com>
wrote:

> Did you change the schema.xml?
>
> -----Original Message-----
> From: threegarages@gmail.com [mailto:threegarages@gmail.com] On Behalf 
> Of Kevin Porter
> Sent: Monday, December 29, 2014 14:10
> To: user@nutch.apache.org
> Subject: crawls but fails at solr indexing
>
> Hi,
>
> I'm new to nutch/solr (although I understand general search engine 
> concepts, having built a topical search engine previously using WIRE 
> and swish-e).
>
> I've installed nutch and solr, following the tutorials as best I can 
> (it's not easy!). I'm having a few problems.
>
> I have nutch crawling with just two sites in the seed.txt:
> nutch.apache.org and 9ballpool.co.uk. (for some reason it won't fetch 
> anything but the robots.txt from 9ballpool.co.uk, but that's not my 
> main problem just now).
>
> I've started solr and started the crawl from the runtime/local dir with:
> >./bin/crawl urls/ collection1 http://localhost:8983/solr/ 5
>
> I started solr in the 'example' dir that came with the solr installation.
>
> It appears to be crawling nutch.apache.org, but then it fails on solr 
> indexing. Here's the last bit of the crawl output:
>
> Parsing 
> http://nutch.apache.org/apidocs/apidocs-2.2/allclasses-frame.html
> Parsing 
> http://nutch.apache.org/apidocs/apidocs-2.2/overview-frame.html
> Parsing 
> http://nutch.apache.org/apidocs/apidocs-2.2/overview-summary.html
> Parsing http://9ballpool.co.uk/
> ParserJob: success
> CrawlDB update for collection1
> DbUpdaterJob: starting
> DbUpdaterJob: done
> Indexing collection1 on SOLR index -> http://localhost:8983/solr/
> SolrIndexerJob: starting
> Adding 60 documents
> Adding 60 documents
> SolrIndexerJob: java.lang.RuntimeException: job failed:
> name=[collection1]solr-index, jobid=job_local280747177_0001
>         at
> org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
>         at
> org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:46)
>         at
>
> org.apache.nutch.indexer.solr.SolrIndexerJob.indexSolr(SolrIndexerJob.java:54)
>         at
> org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:76)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at
> org.apache.nutch.indexer.solr.SolrIndexerJob.main(SolrIndexerJob.java:
> 85)
>
>
> What am I doing wrong?
>
> thanks,
>
> - Kev
>
>
> --
> http://themapps.com
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
>
> This e-mail and any attachments may contain confidential material for 
> the sole use of the intended recipient(s). Any review or distribution 
> by others is strictly prohibited. If you are not the intended 
> recipient, please contact the sender and delete all copies.
>



--
http://themapps.com
---------------------------------------------------------------------
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Re: crawls but fails at solr indexing

Posted by Kevin Porter <ke...@tinternet.mobi>.

I tried changing a few things (not easy to make sense of the various
contradictory or out of date tutorials), nothing worked. At present I think
the only change is I changed solr's schema.xml "uniquekey" tag to 'url'
instead of 'id'.

Do you mean the schema.xml in nutch or solr?

Can you tell me definitively which schema.xml to change and what changes to
make?



On 29 December 2014 at 12:17, Chaushu, Shani <sh...@intel.com>
wrote:

> Did you change the schema.xml?
>
> -----Original Message-----
> From: threegarages@gmail.com [mailto:threegarages@gmail.com] On Behalf Of
> Kevin Porter
> Sent: Monday, December 29, 2014 14:10
> To: user@nutch.apache.org
> Subject: crawls but fails at solr indexing
>
> Hi,
>
> I'm new to nutch/solr (although I understand general search engine
> concepts, having built a topical search engine previously using WIRE and
> swish-e).
>
> I've installed nutch and solr, following the tutorials as best I can (it's
> not easy!). I'm having a few problems.
>
> I have nutch crawling with just two sites in the seed.txt:
> nutch.apache.org and 9ballpool.co.uk. (for some reason it won't fetch
> anything but the robots.txt from 9ballpool.co.uk, but that's not my main
> problem just now).
>
> I've started solr and started the crawl from the runtime/local dir with:
> >./bin/crawl urls/ collection1 http://localhost:8983/solr/ 5
>
> I started solr in the 'example' dir that came with the solr installation.
>
> It appears to be crawling nutch.apache.org, but then it fails on solr
> indexing. Here's the last bit of the crawl output:
>
> Parsing http://nutch.apache.org/apidocs/apidocs-2.2/allclasses-frame.html
> Parsing http://nutch.apache.org/apidocs/apidocs-2.2/overview-frame.html
> Parsing http://nutch.apache.org/apidocs/apidocs-2.2/overview-summary.html
> Parsing http://9ballpool.co.uk/
> ParserJob: success
> CrawlDB update for collection1
> DbUpdaterJob: starting
> DbUpdaterJob: done
> Indexing collection1 on SOLR index -> http://localhost:8983/solr/
> SolrIndexerJob: starting
> Adding 60 documents
> Adding 60 documents
> SolrIndexerJob: java.lang.RuntimeException: job failed:
> name=[collection1]solr-index, jobid=job_local280747177_0001
>         at
> org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
>         at
> org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:46)
>         at
>
> org.apache.nutch.indexer.solr.SolrIndexerJob.indexSolr(SolrIndexerJob.java:54)
>         at
> org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:76)
>         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>         at
> org.apache.nutch.indexer.solr.SolrIndexerJob.main(SolrIndexerJob.java:85)
>
>
> What am I doing wrong?
>
> thanks,
>
> - Kev
>
>
> --
> http://themapps.com
> ---------------------------------------------------------------------
> Intel Electronics Ltd.
>
> This e-mail and any attachments may contain confidential material for
> the sole use of the intended recipient(s). Any review or distribution
> by others is strictly prohibited. If you are not the intended
> recipient, please contact the sender and delete all copies.
>



-- 
http://themapps.com

RE: crawls but fails at solr indexing

Posted by "Chaushu, Shani" <sh...@intel.com>.

Did you change the schema.xml?

-----Original Message-----
From: threegarages@gmail.com [mailto:threegarages@gmail.com] On Behalf Of Kevin Porter
Sent: Monday, December 29, 2014 14:10
To: user@nutch.apache.org
Subject: crawls but fails at solr indexing

Hi,

I'm new to nutch/solr (although I understand general search engine concepts, having built a topical search engine previously using WIRE and swish-e).

I've installed nutch and solr, following the tutorials as best I can (it's not easy!). I'm having a few problems.

I have nutch crawling with just two sites in the seed.txt: nutch.apache.org and 9ballpool.co.uk. (for some reason it won't fetch anything but the robots.txt from 9ballpool.co.uk, but that's not my main problem just now).

I've started solr and started the crawl from the runtime/local dir with:
>./bin/crawl urls/ collection1 http://localhost:8983/solr/ 5

I started solr in the 'example' dir that came with the solr installation.

It appears to be crawling nutch.apache.org, but then it fails on solr indexing. Here's the last bit of the crawl output:

Parsing http://nutch.apache.org/apidocs/apidocs-2.2/allclasses-frame.html
Parsing http://nutch.apache.org/apidocs/apidocs-2.2/overview-frame.html
Parsing http://nutch.apache.org/apidocs/apidocs-2.2/overview-summary.html
Parsing http://9ballpool.co.uk/
ParserJob: success
CrawlDB update for collection1
DbUpdaterJob: starting
DbUpdaterJob: done
Indexing collection1 on SOLR index -> http://localhost:8983/solr/
SolrIndexerJob: starting
Adding 60 documents
Adding 60 documents
SolrIndexerJob: java.lang.RuntimeException: job failed:
name=[collection1]solr-index, jobid=job_local280747177_0001
        at
org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
        at
org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:46)
        at
org.apache.nutch.indexer.solr.SolrIndexerJob.indexSolr(SolrIndexerJob.java:54)
        at
org.apache.nutch.indexer.solr.SolrIndexerJob.run(SolrIndexerJob.java:76)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at
org.apache.nutch.indexer.solr.SolrIndexerJob.main(SolrIndexerJob.java:85)

What am I doing wrong?

thanks,

- Kev

--
http://themapps.com
---------------------------------------------------------------------
Intel Electronics Ltd.

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.