You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Drulea, Sherban" <sd...@rand.org> on 2015/10/01 02:21:31 UTC

Re: Unable to use nutch 2.3 crawl script for MySQL, Mongo, or Cassandra

Hi Lewis,


On 9/30/15, 11:05 AM, "Lewis John Mcgibbney" <le...@gmail.com>
wrote:

>Hi Sherban,
>
>On Wed, Sep 30, 2015 at 6:46 AM, <us...@nutch.apache.org>
>wrote:
>
>>
>> I tried with SOLR 4.9.1.
>>
>
>OK. As I said Solr 4.6 is supported but never mind.

OK. I¹m using SOLR 4.6.0.

I replaced solr-4.6.0/example/solr/collection1/conf/schema.xml with file
from https://github.com/apache/nutch/blob/2.x/conf/schema.xml.

When I start SOLR 4.6.0. With "java -jar start.jar², I get this error:
1094 [coreLoadExecutor-3-thread-1] INFO
org.apache.solr.update.SolrIndexConfig  ­ IndexWriter infoStream solr
logging is enabled
1097 [coreLoadExecutor-3-thread-1] INFO  org.apache.solr.core.SolrConfig
­ Using Lucene MatchVersion: LUCENE_46
1160 [coreLoadExecutor-3-thread-1] INFO  org.apache.solr.core.Config  ­
Loaded SolrConfig: solrconfig.xml
1164 [coreLoadExecutor-3-thread-1] INFO
org.apache.solr.schema.IndexSchema  ­ Reading Solr Schema from schema.xml
1176 [coreLoadExecutor-3-thread-1] INFO
org.apache.solr.schema.IndexSchema  ­ [collection1] Schema name=nutch
1241 [coreLoadExecutor-3-thread-1] INFO
org.apache.solr.schema.IndexSchema  ­ default search field in schema is
text
1242 [coreLoadExecutor-3-thread-1] INFO
org.apache.solr.schema.IndexSchema  ­ query parser default operator is OR
1242 [coreLoadExecutor-3-thread-1] INFO
org.apache.solr.schema.IndexSchema  ­ unique key field: id
1243 [coreLoadExecutor-3-thread-1] ERROR
org.apache.solr.core.CoreContainer  ­ Unable to create core: collection1
org.apache.solr.common.SolrException: copyField source :'rawcontent' is
not a glob and doesn't match any explicit field or dynamicField.. Schema
file is 
/Users/sdrulea/Downloads/solr-4.6.0/example/solr/collection1/schema.xml
	at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:608)
	at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:166)
	at 
org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55
)
	at 
org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFacto
ry.java:69)
	at 
org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:554)
	at org.apache.solr.core.CoreContainer.create(CoreContainer.java:592)
	at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:271)
	at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:263)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1
142)
	at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:
617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.solr.common.SolrException: copyField source
:'rawcontent' is not a glob and doesn't match any explicit field or
dynamicField.
	at 
org.apache.solr.schema.IndexSchema.registerCopyField(IndexSchema.java:855)
	at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:592)
	... 13 more
1245 [coreLoadExecutor-3-thread-1] ERROR
org.apache.solr.core.CoreContainer  ­
null:org.apache.solr.common.SolrException: Unable to create core:
collection1
	at 
org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:977)
	at org.apache.solr.core.CoreContainer.create(CoreContainer.java:601)
	at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:271)
	at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:263)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1
142)
	at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:
617)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.solr.common.SolrException: copyField source
:'rawcontent' is not a glob and doesn't match any explicit field or
dynamicField.. Schema file is
/Users/sdrulea/Downloads/solr-4.6.0/example/solr/collection1/schema.xml
	at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:608)
	at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:166)
	at 
org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55
)
	at 
org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFacto
ry.java:69)
	at 
org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:554)
	at org.apache.solr.core.CoreContainer.create(CoreContainer.java:592)
	... 8 more
Caused by: org.apache.solr.common.SolrException: copyField source
:'rawcontent' is not a glob and doesn't match any explicit field or
dynamicField.
	at 
org.apache.solr.schema.IndexSchema.registerCopyField(IndexSchema.java:855)
	at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:592)
	... 13 more

1247 [main] INFO  org.apache.solr.servlet.SolrDispatchFilter  ­
user.dir=/Users/sdrulea/Downloads/solr-4.6.0/example
1247 [main] INFO  org.apache.solr.servlet.SolrDispatchFilter  ­
SolrDispatchFilter.init() done
1263 [main] INFO  org.eclipse.jetty.server.AbstractConnector  ­ Started
SocketConnector@0.0.0.0:8983


The only changes I made to schema.xml were to comment out lines with
³protwords.txt² as the tutorial suggested. Has anyone tested the 2.3.1
schema.xml with SOLR 4.6.1?

>
>
>>
>> I copied /release-2.3.1/runtime/local/conf/schema.xml to
>> solr-4.9.1/example/solr/collection1/conf/schema.xml
>>
>
>Good.
>
>
>>
>> Result of /release-2.3.1/runtime/local/bin/crawl urls method_centers
>> http://localhost:8983/solr 2
>>
>>
>> InjectorJob: total number of urls rejected by filters: 1
>>
>
>Notice that you regex urlfilter is rejecting one of your seed URLs.

One of my original URLs ended with ³/". I added index.html and that fixed
the rejection.

InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and
filtering: 11


>
>
>> InjectorJob: total number of urls injected after normalization and
>> filtering: 5
>>
>
>[...snip]
>
>GeneratorJob: generated batch id: 1443556518-1067112789 containing 0 URLs
>> Generate returned 1 (no new segments created)
>> Escaping loop: no more URLs to fetch now
>>
>> There are 6 URLs in my urls/seeds.txt file. Why does it say 0 URLs?
>>
>
>1 was rejected as explained above. Additionally, it seems like there is
>also an error fetching your seeds and parsing out hyperlinks. I would
>encourage you to check the early stages of configuring and prepping your
>crawler. Some configuration is incorrect... possibly more problems with
>your regex urlfilters.

My regex-urlfilter.txt is unmodified:
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP
|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bm
p|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+.


I copied plugin.includes to local/conf/nutch-site.xml. I aded httpclient &
indexer-solr
<property>
        <name>plugin.includes</name>
        
<value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index-(
basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr</v
alue>

        <description>Regular expression naming plugin directory names to
         include.  Any plugin not matching this expression is excluded.
         In any case you need at least include the nutch-extensionpoints
plugin. By
         default Nutch includes crawling just HTML and plain text via HTTP,
         and basic indexing and search plugins. In order to use HTTPS
please enable 
         protocol-httpclient, but be aware of possible intermittent
problems with the 
         underlying commons-httpclient library.
         </description>
   </property>


Nutch still doesn¹t parse any links. Any ideas?

InjectorJob: total number of urls injected after normalization and
filtering: 11
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch fetch -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -D fetcher.timelimit.mins=180
1443657910-4394 -crawlId method_centers -threads 50
FetcherJob: starting at 2015-09-30 17:05:14
FetcherJob: batchId: 1443657910-4394
FetcherJob: threads: 50
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : 1443668714323
Using queue mode : byHost
Fetcher: threads: 50
QueueFeeder finished: total 0 records. Hit by time limit :0
Š.
Fetcher: throughput threshold sequence: 5
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs
in 0 queues




-activeThreads=0
Using queue mode : byHost
Fetcher: threads: 50
QueueFeeder finished: total 0 records. Hit by time limit :0
Š.

-finishing thread FetcherThread49, activeThreads=0
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs
in 0 queues


Parsing : 
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch parse -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -D
mapred.skip.attempts.to.start.skipping=2 -D
mapred.skip.map.max.skip.records=1 1443657910-4394 -crawlId method_centers
ParserJob: starting at 2015-09-30 17:05:27
ParserJob: resuming:	false
ParserJob: forced reparse:	false
ParserJob: batchId:	1443657910-4394
ParserJob: success
ParserJob: finished at 2015-09-30 17:05:29, time elapsed: 00:00:02
CrawlDB update for method_centers
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch updatedb -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true 1443657910-4394 -crawlId method_centers
DbUpdaterJob: starting at 2015-09-30 17:05:30
DbUpdaterJob: batchId: 1443657910-4394
DbUpdaterJob: finished at 2015-09-30 17:05:32, time elapsed: 00:00:02
Indexing method_centers on SOLR index -> http://localhost:8983/solr
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch index -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -D
solr.server.url=http://localhost:8983/solr -all -crawlId method_centers




>
>
>>
>>
>> The index job worked but there¹s no data in SOLR. Is there a known good
>> version of SOLR that works with 2.3.1 schema.xml? Are the tutorial
>> instructions still valid?
>>
>
>Not it did not. It failed. Look at the hadoop.log.
>Also please look at your solr.log, it will provide you with better insight
>into what is wrong with your Solr server and what messages are failing.
>Thanks

The nutch schema.xml doesn¹t work on my SOLR 4.6.0:

IndexingJob: starting
No IndexWriters activated - check your configuration

IndexingJob: done.
SOLR dedup -> http://localhost:8983/solr
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch solrdedup -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true http://localhost:8983/solr
Exception in thread "main"
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
Expected content type application/octet-stream but got
text/html;charset=ISO-8859-1. <html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
<title>Error 500 {msg=SolrCore 'collection1' is not available due to init
failure: copyField source :'rawcontent' is not a glob and doesn't match
any explicit field or dynamicField.. Schema file is
/Users/sdrulea/Downloads/solr-4.6.0/example/solr/collection1/schema.xml,tra
ce=org.apache.solr.common.SolrException: SolrCore 'collection1' is not
available due to init failure: copyField source :'rawcontent' is not a
glob and doesn't match any explicit field or dynamicField.. Schema file is
/Users/sdrulea/Downloads/solr-4.6.0/example/solr/collection1/schema.xml
	at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:818)
	at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java
:297)
	at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java
:197)
	at 
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandle
r.java:1419)
	at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
	at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:13
7)
	at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
	at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.jav
a:231)
	at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.jav
a:1075)
	at 
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
	at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java
:193)
	at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java
:1009)
	at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:13
5)
	at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHan
dlerCollection.java:255)
	at 
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection
.java:154)
	at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:
116)
	at org.eclipse.jetty.server.Server.handle(Server.java:368)
	at 
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpC
onnection.java:489)
	at 
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpC
onnection.java:53)
	at 
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttp
Connection.java:942)
	at 
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComple
te(AbstractHttpConnection.java:1004)
	at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)
	at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
	at 
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnecti
on.java:72)
	at 
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketCo
nnector.java:264)
	at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java
:608)
	at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:
543)
	at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.solr.common.SolrException: copyField source
:'rawcontent' is not a glob and doesn't match any explicit field or
dynamicField.. Schema file is
/Users/sdrulea/Downloads/solr-4.6.0/example/solr/collection1/schema.xml
	at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:608)
	at org.apache.solr.schema.IndexSchema.&lt;init&gt;(IndexSchema.java:166)
	at 
org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55
)
	at 
org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFacto
ry.java:69)
	at 
org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:554)
	at org.apache.solr.core.CoreContainer.create(CoreContainer.java:592)
	at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:271)
	at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:263)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1
142)
	at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:
617)
	... 1 more
Caused by: org.apache.solr.common.SolrException: copyField source
:'rawcontent' is not a glob and doesn't match any explicit field or
dynamicField.
	at 
org.apache.solr.schema.IndexSchema.registerCopyField(IndexSchema.java:855)
	at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:592)
	... 13 more



Cheers,
Sherban


__________________________________________________________________________

This email message is for the sole use of the intended recipient(s) and
may contain confidential information. Any unauthorized review, use,
disclosure or distribution is prohibited. If you are not the intended
recipient, please contact the sender by reply email and destroy all copies
of the original message.


Re: Unable to use nutch 2.3 crawl script for MySQL, Mongo, or Cassandra

Posted by "Drulea, Sherban" <sd...@rand.org>.
Uncommenting <copyField source="rawcontent" dest="text”/> in schema.xml
fixed the issue with SOLR.

Now there are no error messages but also no parsing :(.

My seed.txt:
---------------------------------------------------------------------------
-------
http://intranet.rand.org/eprm/rand-initiated-research/proposals/fy2015/inde
x.html
http://intranet.rand.org/eprm/rand-initiated-research/2015.html
http://intranet.rand.org/eprm/rand-initiated-research/faq.html
http://intranet.rand.org/eprm/rand-initiated-research/index.html
---------------------------------------------------------------------------
-------



My nutch-site.xml:
---------------------------------------------------------------------------
-------

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

    <property>
        <name>http.agent.name</name>
        <value>nutch Mongo Solr Crawler</value>
    </property>

    <property>
        <name>storage.data.store.class</name>
        <value>org.apache.gora.mongodb.store.MongoStore</value>
        <description>Default class for storing data</description>
    </property>
    
    <property>
        <name>plugin.includes</name>
        
<value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index-(
basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr</v
alue>
        <description>Regular expression naming plugin directory names to
         include.  Any plugin not matching this expression is excluded.
         In any case you need at least include the nutch-extensionpoints
plugin. By
         default Nutch includes crawling just HTML and plain text via HTTP,
         and basic indexing and search plugins. In order to use HTTPS
please enable 
         protocol-httpclient, but be aware of possible intermittent
problems with the 
         underlying commons-httpclient library.
         </description>
   </property>
    
</configuration>
---------------------------------------------------------------------------
-------




My regex-urlfilter.txt:
---------------------------------------------------------------------------
-------

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP
|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bm
p|BMP|js|JS)$

# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+.
—————————————————————————————————————————


I see these warnings in my hadoop.log:

2015-09-30 17:32:53,466 WARN  util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes
where applicable

015-09-30 17:32:54,571 WARN  conf.Configuration -
file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1728069154/.staging/job_loca
l1728069154_0001/job.xml:an attempt to override final parameter:
mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2015-09-30 17:32:54,573 WARN  conf.Configuration -
file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1728069154/.staging/job_loca
l1728069154_0001/job.xml:an attempt to override final parameter:
mapreduce.job.end-notification.max.attempts;  Ignoring.
2015-09-30 17:32:54,652 WARN  conf.Configuration -
file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local17280691
54_0001/job_local1728069154_0001.xml:an attempt to override final
parameter: mapreduce.job.end-notification.max.retry.interval;  Ignoring.
2015-09-30 17:32:54,654 WARN  conf.Configuration -
file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local17280691
54_0001/job_local1728069154_0001.xml:an attempt to override final
parameter: mapreduce.job.end-notification.max.attempts;  Ignoring.


Any ideas?



On 9/30/15, 5:21 PM, "Drulea, Sherban" <sd...@rand.org> wrote:

>Hi Lewis,
>
>
>On 9/30/15, 11:05 AM, "Lewis John Mcgibbney" <le...@gmail.com>
>wrote:
>
>>Hi Sherban,
>>
>>On Wed, Sep 30, 2015 at 6:46 AM, <us...@nutch.apache.org>
>>wrote:
>>
>>>
>>> I tried with SOLR 4.9.1.
>>>
>>
>>OK. As I said Solr 4.6 is supported but never mind.
>
>OK. I¹m using SOLR 4.6.0.
>
>I replaced solr-4.6.0/example/solr/collection1/conf/schema.xml with file
>from https://github.com/apache/nutch/blob/2.x/conf/schema.xml.
>
>When I start SOLR 4.6.0. With "java -jar start.jar², I get this error:
>1094 [coreLoadExecutor-3-thread-1] INFO
>org.apache.solr.update.SolrIndexConfig  ­ IndexWriter infoStream solr
>logging is enabled
>1097 [coreLoadExecutor-3-thread-1] INFO  org.apache.solr.core.SolrConfig
>­ Using Lucene MatchVersion: LUCENE_46
>1160 [coreLoadExecutor-3-thread-1] INFO  org.apache.solr.core.Config  ­
>Loaded SolrConfig: solrconfig.xml
>1164 [coreLoadExecutor-3-thread-1] INFO
>org.apache.solr.schema.IndexSchema  ­ Reading Solr Schema from schema.xml
>1176 [coreLoadExecutor-3-thread-1] INFO
>org.apache.solr.schema.IndexSchema  ­ [collection1] Schema name=nutch
>1241 [coreLoadExecutor-3-thread-1] INFO
>org.apache.solr.schema.IndexSchema  ­ default search field in schema is
>text
>1242 [coreLoadExecutor-3-thread-1] INFO
>org.apache.solr.schema.IndexSchema  ­ query parser default operator is OR
>1242 [coreLoadExecutor-3-thread-1] INFO
>org.apache.solr.schema.IndexSchema  ­ unique key field: id
>1243 [coreLoadExecutor-3-thread-1] ERROR
>org.apache.solr.core.CoreContainer  ­ Unable to create core: collection1
>org.apache.solr.common.SolrException: copyField source :'rawcontent' is
>not a glob and doesn't match any explicit field or dynamicField.. Schema
>file is 
>/Users/sdrulea/Downloads/solr-4.6.0/example/solr/collection1/schema.xml
>	at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:608)
>	at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:166)
>	at 
>org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:5
>5
>)
>	at 
>org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFact
>o
>ry.java:69)
>	at 
>org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:554)
>	at org.apache.solr.core.CoreContainer.create(CoreContainer.java:592)
>	at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:271)
>	at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:263)
>	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>	at 
>java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>	at 
>java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
>1
>142)
>	at 
>java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java
>:
>617)
>	at java.lang.Thread.run(Thread.java:745)
>Caused by: org.apache.solr.common.SolrException: copyField source
>:'rawcontent' is not a glob and doesn't match any explicit field or
>dynamicField.
>	at 
>org.apache.solr.schema.IndexSchema.registerCopyField(IndexSchema.java:855)
>	at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:592)
>	... 13 more
>1245 [coreLoadExecutor-3-thread-1] ERROR
>org.apache.solr.core.CoreContainer  ­
>null:org.apache.solr.common.SolrException: Unable to create core:
>collection1
>	at 
>org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:977)
>	at org.apache.solr.core.CoreContainer.create(CoreContainer.java:601)
>	at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:271)
>	at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:263)
>	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>	at 
>java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>	at 
>java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
>1
>142)
>	at 
>java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java
>:
>617)
>	at java.lang.Thread.run(Thread.java:745)
>Caused by: org.apache.solr.common.SolrException: copyField source
>:'rawcontent' is not a glob and doesn't match any explicit field or
>dynamicField.. Schema file is
>/Users/sdrulea/Downloads/solr-4.6.0/example/solr/collection1/schema.xml
>	at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:608)
>	at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:166)
>	at 
>org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:5
>5
>)
>	at 
>org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFact
>o
>ry.java:69)
>	at 
>org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:554)
>	at org.apache.solr.core.CoreContainer.create(CoreContainer.java:592)
>	... 8 more
>Caused by: org.apache.solr.common.SolrException: copyField source
>:'rawcontent' is not a glob and doesn't match any explicit field or
>dynamicField.
>	at 
>org.apache.solr.schema.IndexSchema.registerCopyField(IndexSchema.java:855)
>	at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:592)
>	... 13 more
>
>1247 [main] INFO  org.apache.solr.servlet.SolrDispatchFilter  ­
>user.dir=/Users/sdrulea/Downloads/solr-4.6.0/example
>1247 [main] INFO  org.apache.solr.servlet.SolrDispatchFilter  ­
>SolrDispatchFilter.init() done
>1263 [main] INFO  org.eclipse.jetty.server.AbstractConnector  ­ Started
>SocketConnector@0.0.0.0:8983
>
>
>The only changes I made to schema.xml were to comment out lines with
>³protwords.txt² as the tutorial suggested. Has anyone tested the 2.3.1
>schema.xml with SOLR 4.6.1?
>
>>
>>
>>>
>>> I copied /release-2.3.1/runtime/local/conf/schema.xml to
>>> solr-4.9.1/example/solr/collection1/conf/schema.xml
>>>
>>
>>Good.
>>
>>
>>>
>>> Result of /release-2.3.1/runtime/local/bin/crawl urls method_centers
>>> http://localhost:8983/solr 2
>>>
>>>
>>> InjectorJob: total number of urls rejected by filters: 1
>>>
>>
>>Notice that you regex urlfilter is rejecting one of your seed URLs.
>
>One of my original URLs ended with ³/". I added index.html and that fixed
>the rejection.
>
>InjectorJob: total number of urls rejected by filters: 0
>InjectorJob: total number of urls injected after normalization and
>filtering: 11
>
>
>>
>>
>>> InjectorJob: total number of urls injected after normalization and
>>> filtering: 5
>>>
>>
>>[...snip]
>>
>>GeneratorJob: generated batch id: 1443556518-1067112789 containing 0 URLs
>>> Generate returned 1 (no new segments created)
>>> Escaping loop: no more URLs to fetch now
>>>
>>> There are 6 URLs in my urls/seeds.txt file. Why does it say 0 URLs?
>>>
>>
>>1 was rejected as explained above. Additionally, it seems like there is
>>also an error fetching your seeds and parsing out hyperlinks. I would
>>encourage you to check the early stages of configuring and prepping your
>>crawler. Some configuration is incorrect... possibly more problems with
>>your regex urlfilters.
>
>My regex-urlfilter.txt is unmodified:
># skip file: ftp: and mailto: urls
>-^(file|ftp|mailto):
>
># skip image and other suffixes we can't yet parse
># for a more extensive coverage use the urlfilter-suffix plugin
>-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZI
>P
>|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|b
>m
>p|BMP|js|JS)$
>
># skip URLs containing certain characters as probable queries, etc.
>-[?*!@=]
>
># skip URLs with slash-delimited segment that repeats 3+ times, to break
>loops
>-.*(/[^/]+)/[^/]+\1/[^/]+\1/
>
># accept anything else
>+.
>
>
>I copied plugin.includes to local/conf/nutch-site.xml. I aded httpclient &
>indexer-solr
><property>
>        <name>plugin.includes</name>
>        
><value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index-
>(
>basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr</
>v
>alue>
>
>        <description>Regular expression naming plugin directory names to
>         include.  Any plugin not matching this expression is excluded.
>         In any case you need at least include the nutch-extensionpoints
>plugin. By
>         default Nutch includes crawling just HTML and plain text via
>HTTP,
>         and basic indexing and search plugins. In order to use HTTPS
>please enable 
>         protocol-httpclient, but be aware of possible intermittent
>problems with the 
>         underlying commons-httpclient library.
>         </description>
>   </property>
>
>
>Nutch still doesn¹t parse any links. Any ideas?
>
>InjectorJob: total number of urls injected after normalization and
>filtering: 11
>/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch fetch -D
>mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>mapred.reduce.tasks.speculative.execution=false -D
>mapred.map.tasks.speculative.execution=false -D
>mapred.compress.map.output=true -D fetcher.timelimit.mins=180
>1443657910-4394 -crawlId method_centers -threads 50
>FetcherJob: starting at 2015-09-30 17:05:14
>FetcherJob: batchId: 1443657910-4394
>FetcherJob: threads: 50
>FetcherJob: parsing: false
>FetcherJob: resuming: false
>FetcherJob : timelimit set for : 1443668714323
>Using queue mode : byHost
>Fetcher: threads: 50
>QueueFeeder finished: total 0 records. Hit by time limit :0
>Š.
>Fetcher: throughput threshold sequence: 5
>0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs
>in 0 queues
>
>
>
>
>-activeThreads=0
>Using queue mode : byHost
>Fetcher: threads: 50
>QueueFeeder finished: total 0 records. Hit by time limit :0
>Š.
>
>-finishing thread FetcherThread49, activeThreads=0
>0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs
>in 0 queues
>
>
>Parsing : 
>/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch parse -D
>mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>mapred.reduce.tasks.speculative.execution=false -D
>mapred.map.tasks.speculative.execution=false -D
>mapred.compress.map.output=true -D
>mapred.skip.attempts.to.start.skipping=2 -D
>mapred.skip.map.max.skip.records=1 1443657910-4394 -crawlId method_centers
>ParserJob: starting at 2015-09-30 17:05:27
>ParserJob: resuming:	false
>ParserJob: forced reparse:	false
>ParserJob: batchId:	1443657910-4394
>ParserJob: success
>ParserJob: finished at 2015-09-30 17:05:29, time elapsed: 00:00:02
>CrawlDB update for method_centers
>/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch updatedb -D
>mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>mapred.reduce.tasks.speculative.execution=false -D
>mapred.map.tasks.speculative.execution=false -D
>mapred.compress.map.output=true 1443657910-4394 -crawlId method_centers
>DbUpdaterJob: starting at 2015-09-30 17:05:30
>DbUpdaterJob: batchId: 1443657910-4394
>DbUpdaterJob: finished at 2015-09-30 17:05:32, time elapsed: 00:00:02
>Indexing method_centers on SOLR index -> http://localhost:8983/solr
>/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch index -D
>mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>mapred.reduce.tasks.speculative.execution=false -D
>mapred.map.tasks.speculative.execution=false -D
>mapred.compress.map.output=true -D
>solr.server.url=http://localhost:8983/solr -all -crawlId method_centers
>
>
>
>
>>
>>
>>>
>>>
>>> The index job worked but there¹s no data in SOLR. Is there a known good
>>> version of SOLR that works with 2.3.1 schema.xml? Are the tutorial
>>> instructions still valid?
>>>
>>
>>Not it did not. It failed. Look at the hadoop.log.
>>Also please look at your solr.log, it will provide you with better
>>insight
>>into what is wrong with your Solr server and what messages are failing.
>>Thanks
>
>The nutch schema.xml doesn¹t work on my SOLR 4.6.0:
>
>IndexingJob: starting
>No IndexWriters activated - check your configuration
>
>IndexingJob: done.
>SOLR dedup -> http://localhost:8983/solr
>/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch solrdedup -D
>mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>mapred.reduce.tasks.speculative.execution=false -D
>mapred.map.tasks.speculative.execution=false -D
>mapred.compress.map.output=true http://localhost:8983/solr
>Exception in thread "main"
>org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
>Expected content type application/octet-stream but got
>text/html;charset=ISO-8859-1. <html>
><head>
><meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
><title>Error 500 {msg=SolrCore 'collection1' is not available due to init
>failure: copyField source :'rawcontent' is not a glob and doesn't match
>any explicit field or dynamicField.. Schema file is
>/Users/sdrulea/Downloads/solr-4.6.0/example/solr/collection1/schema.xml,tr
>a
>ce=org.apache.solr.common.SolrException: SolrCore 'collection1' is not
>available due to init failure: copyField source :'rawcontent' is not a
>glob and doesn't match any explicit field or dynamicField.. Schema file is
>/Users/sdrulea/Downloads/solr-4.6.0/example/solr/collection1/schema.xml
>	at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:818)
>	at 
>org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.jav
>a
>:297)
>	at 
>org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.jav
>a
>:197)
>	at 
>org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandl
>e
>r.java:1419)
>	at 
>org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
>	at 
>org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:1
>3
>7)
>	at 
>org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557
>)
>	at 
>org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.ja
>v
>a:231)
>	at 
>org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.ja
>v
>a:1075)
>	at 
>org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
>	at 
>org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.jav
>a
>:193)
>	at 
>org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.jav
>a
>:1009)
>	at 
>org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:1
>3
>5)
>	at 
>org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHa
>n
>dlerCollection.java:255)
>	at 
>org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollectio
>n
>.java:154)
>	at 
>org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java
>:
>116)
>	at org.eclipse.jetty.server.Server.handle(Server.java:368)
>	at 
>org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttp
>C
>onnection.java:489)
>	at 
>org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttp
>C
>onnection.java:53)
>	at 
>org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHtt
>p
>Connection.java:942)
>	at 
>org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerCompl
>e
>te(AbstractHttpConnection.java:1004)
>	at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)
>	at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
>	at 
>org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnect
>i
>on.java:72)
>	at 
>org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketC
>o
>nnector.java:264)
>	at 
>org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.jav
>a
>:608)
>	at 
>org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java
>:
>543)
>	at java.lang.Thread.run(Thread.java:745)
>Caused by: org.apache.solr.common.SolrException: copyField source
>:'rawcontent' is not a glob and doesn't match any explicit field or
>dynamicField.. Schema file is
>/Users/sdrulea/Downloads/solr-4.6.0/example/solr/collection1/schema.xml
>	at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:608)
>	at org.apache.solr.schema.IndexSchema.&lt;init&gt;(IndexSchema.java:166)
>	at 
>org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:5
>5
>)
>	at 
>org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFact
>o
>ry.java:69)
>	at 
>org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:554)
>	at org.apache.solr.core.CoreContainer.create(CoreContainer.java:592)
>	at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:271)
>	at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:263)
>	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>	at 
>java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>	at 
>java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
>1
>142)
>	at 
>java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java
>:
>617)
>	... 1 more
>Caused by: org.apache.solr.common.SolrException: copyField source
>:'rawcontent' is not a glob and doesn't match any explicit field or
>dynamicField.
>	at 
>org.apache.solr.schema.IndexSchema.registerCopyField(IndexSchema.java:855)
>	at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:592)
>	... 13 more
>
>
>
>Cheers,
>Sherban
>
>
>__________________________________________________________________________
>
>This email message is for the sole use of the intended recipient(s) and
>may contain confidential information. Any unauthorized review, use,
>disclosure or distribution is prohibited. If you are not the intended
>recipient, please contact the sender by reply email and destroy all copies
>of the original message.
>