You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "Drulea, Sherban" <sd...@rand.org> on 2015/10/01 02:21:31 UTC
Re: Unable to use nutch 2.3 crawl script for MySQL, Mongo, or
Cassandra
Hi Lewis,
On 9/30/15, 11:05 AM, "Lewis John Mcgibbney" <le...@gmail.com>
wrote:
>Hi Sherban,
>
>On Wed, Sep 30, 2015 at 6:46 AM, <us...@nutch.apache.org>
>wrote:
>
>>
>> I tried with SOLR 4.9.1.
>>
>
>OK. As I said Solr 4.6 is supported but never mind.
OK. I¹m using SOLR 4.6.0.
I replaced solr-4.6.0/example/solr/collection1/conf/schema.xml with file
from https://github.com/apache/nutch/blob/2.x/conf/schema.xml.
When I start SOLR 4.6.0. With "java -jar start.jar², I get this error:
1094 [coreLoadExecutor-3-thread-1] INFO
org.apache.solr.update.SolrIndexConfig IndexWriter infoStream solr
logging is enabled
1097 [coreLoadExecutor-3-thread-1] INFO org.apache.solr.core.SolrConfig
Using Lucene MatchVersion: LUCENE_46
1160 [coreLoadExecutor-3-thread-1] INFO org.apache.solr.core.Config
Loaded SolrConfig: solrconfig.xml
1164 [coreLoadExecutor-3-thread-1] INFO
org.apache.solr.schema.IndexSchema Reading Solr Schema from schema.xml
1176 [coreLoadExecutor-3-thread-1] INFO
org.apache.solr.schema.IndexSchema [collection1] Schema name=nutch
1241 [coreLoadExecutor-3-thread-1] INFO
org.apache.solr.schema.IndexSchema default search field in schema is
text
1242 [coreLoadExecutor-3-thread-1] INFO
org.apache.solr.schema.IndexSchema query parser default operator is OR
1242 [coreLoadExecutor-3-thread-1] INFO
org.apache.solr.schema.IndexSchema unique key field: id
1243 [coreLoadExecutor-3-thread-1] ERROR
org.apache.solr.core.CoreContainer Unable to create core: collection1
org.apache.solr.common.SolrException: copyField source :'rawcontent' is
not a glob and doesn't match any explicit field or dynamicField.. Schema
file is
/Users/sdrulea/Downloads/solr-4.6.0/example/solr/collection1/schema.xml
at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:608)
at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:166)
at
org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55
)
at
org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFacto
ry.java:69)
at
org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:554)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:592)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:271)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:263)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1
142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:
617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.solr.common.SolrException: copyField source
:'rawcontent' is not a glob and doesn't match any explicit field or
dynamicField.
at
org.apache.solr.schema.IndexSchema.registerCopyField(IndexSchema.java:855)
at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:592)
... 13 more
1245 [coreLoadExecutor-3-thread-1] ERROR
org.apache.solr.core.CoreContainer
null:org.apache.solr.common.SolrException: Unable to create core:
collection1
at
org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:977)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:601)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:271)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:263)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1
142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:
617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.solr.common.SolrException: copyField source
:'rawcontent' is not a glob and doesn't match any explicit field or
dynamicField.. Schema file is
/Users/sdrulea/Downloads/solr-4.6.0/example/solr/collection1/schema.xml
at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:608)
at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:166)
at
org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55
)
at
org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFacto
ry.java:69)
at
org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:554)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:592)
... 8 more
Caused by: org.apache.solr.common.SolrException: copyField source
:'rawcontent' is not a glob and doesn't match any explicit field or
dynamicField.
at
org.apache.solr.schema.IndexSchema.registerCopyField(IndexSchema.java:855)
at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:592)
... 13 more
1247 [main] INFO org.apache.solr.servlet.SolrDispatchFilter
user.dir=/Users/sdrulea/Downloads/solr-4.6.0/example
1247 [main] INFO org.apache.solr.servlet.SolrDispatchFilter
SolrDispatchFilter.init() done
1263 [main] INFO org.eclipse.jetty.server.AbstractConnector Started
SocketConnector@0.0.0.0:8983
The only changes I made to schema.xml were to comment out lines with
³protwords.txt² as the tutorial suggested. Has anyone tested the 2.3.1
schema.xml with SOLR 4.6.1?
>
>
>>
>> I copied /release-2.3.1/runtime/local/conf/schema.xml to
>> solr-4.9.1/example/solr/collection1/conf/schema.xml
>>
>
>Good.
>
>
>>
>> Result of /release-2.3.1/runtime/local/bin/crawl urls method_centers
>> http://localhost:8983/solr 2
>>
>>
>> InjectorJob: total number of urls rejected by filters: 1
>>
>
>Notice that you regex urlfilter is rejecting one of your seed URLs.
One of my original URLs ended with ³/". I added index.html and that fixed
the rejection.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and
filtering: 11
>
>
>> InjectorJob: total number of urls injected after normalization and
>> filtering: 5
>>
>
>[...snip]
>
>GeneratorJob: generated batch id: 1443556518-1067112789 containing 0 URLs
>> Generate returned 1 (no new segments created)
>> Escaping loop: no more URLs to fetch now
>>
>> There are 6 URLs in my urls/seeds.txt file. Why does it say 0 URLs?
>>
>
>1 was rejected as explained above. Additionally, it seems like there is
>also an error fetching your seeds and parsing out hyperlinks. I would
>encourage you to check the early stages of configuring and prepping your
>crawler. Some configuration is incorrect... possibly more problems with
>your regex urlfilters.
My regex-urlfilter.txt is unmodified:
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP
|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bm
p|BMP|js|JS)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept anything else
+.
I copied plugin.includes to local/conf/nutch-site.xml. I aded httpclient &
indexer-solr
<property>
<name>plugin.includes</name>
<value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index-(
basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr</v
alue>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints
plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS
please enable
protocol-httpclient, but be aware of possible intermittent
problems with the
underlying commons-httpclient library.
</description>
</property>
Nutch still doesn¹t parse any links. Any ideas?
InjectorJob: total number of urls injected after normalization and
filtering: 11
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch fetch -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -D fetcher.timelimit.mins=180
1443657910-4394 -crawlId method_centers -threads 50
FetcherJob: starting at 2015-09-30 17:05:14
FetcherJob: batchId: 1443657910-4394
FetcherJob: threads: 50
FetcherJob: parsing: false
FetcherJob: resuming: false
FetcherJob : timelimit set for : 1443668714323
Using queue mode : byHost
Fetcher: threads: 50
QueueFeeder finished: total 0 records. Hit by time limit :0
Š.
Fetcher: throughput threshold sequence: 5
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs
in 0 queues
-activeThreads=0
Using queue mode : byHost
Fetcher: threads: 50
QueueFeeder finished: total 0 records. Hit by time limit :0
Š.
-finishing thread FetcherThread49, activeThreads=0
0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs
in 0 queues
Parsing :
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch parse -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -D
mapred.skip.attempts.to.start.skipping=2 -D
mapred.skip.map.max.skip.records=1 1443657910-4394 -crawlId method_centers
ParserJob: starting at 2015-09-30 17:05:27
ParserJob: resuming: false
ParserJob: forced reparse: false
ParserJob: batchId: 1443657910-4394
ParserJob: success
ParserJob: finished at 2015-09-30 17:05:29, time elapsed: 00:00:02
CrawlDB update for method_centers
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch updatedb -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true 1443657910-4394 -crawlId method_centers
DbUpdaterJob: starting at 2015-09-30 17:05:30
DbUpdaterJob: batchId: 1443657910-4394
DbUpdaterJob: finished at 2015-09-30 17:05:32, time elapsed: 00:00:02
Indexing method_centers on SOLR index -> http://localhost:8983/solr
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch index -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true -D
solr.server.url=http://localhost:8983/solr -all -crawlId method_centers
>
>
>>
>>
>> The index job worked but there¹s no data in SOLR. Is there a known good
>> version of SOLR that works with 2.3.1 schema.xml? Are the tutorial
>> instructions still valid?
>>
>
>Not it did not. It failed. Look at the hadoop.log.
>Also please look at your solr.log, it will provide you with better insight
>into what is wrong with your Solr server and what messages are failing.
>Thanks
The nutch schema.xml doesn¹t work on my SOLR 4.6.0:
IndexingJob: starting
No IndexWriters activated - check your configuration
IndexingJob: done.
SOLR dedup -> http://localhost:8983/solr
/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch solrdedup -D
mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
mapred.reduce.tasks.speculative.execution=false -D
mapred.map.tasks.speculative.execution=false -D
mapred.compress.map.output=true http://localhost:8983/solr
Exception in thread "main"
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
Expected content type application/octet-stream but got
text/html;charset=ISO-8859-1. <html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
<title>Error 500 {msg=SolrCore 'collection1' is not available due to init
failure: copyField source :'rawcontent' is not a glob and doesn't match
any explicit field or dynamicField.. Schema file is
/Users/sdrulea/Downloads/solr-4.6.0/example/solr/collection1/schema.xml,tra
ce=org.apache.solr.common.SolrException: SolrCore 'collection1' is not
available due to init failure: copyField source :'rawcontent' is not a
glob and doesn't match any explicit field or dynamicField.. Schema file is
/Users/sdrulea/Downloads/solr-4.6.0/example/solr/collection1/schema.xml
at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:818)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java
:297)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java
:197)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandle
r.java:1419)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:13
7)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.jav
a:231)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.jav
a:1075)
at
org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java
:193)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java
:1009)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:13
5)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHan
dlerCollection.java:255)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection
.java:154)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:
116)
at org.eclipse.jetty.server.Server.handle(Server.java:368)
at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpC
onnection.java:489)
at
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpC
onnection.java:53)
at
org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHttp
Connection.java:942)
at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerComple
te(AbstractHttpConnection.java:1004)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
at
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnecti
on.java:72)
at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketCo
nnector.java:264)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java
:608)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:
543)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.solr.common.SolrException: copyField source
:'rawcontent' is not a glob and doesn't match any explicit field or
dynamicField.. Schema file is
/Users/sdrulea/Downloads/solr-4.6.0/example/solr/collection1/schema.xml
at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:608)
at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:166)
at
org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:55
)
at
org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFacto
ry.java:69)
at
org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:554)
at org.apache.solr.core.CoreContainer.create(CoreContainer.java:592)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:271)
at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:263)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1
142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:
617)
... 1 more
Caused by: org.apache.solr.common.SolrException: copyField source
:'rawcontent' is not a glob and doesn't match any explicit field or
dynamicField.
at
org.apache.solr.schema.IndexSchema.registerCopyField(IndexSchema.java:855)
at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:592)
... 13 more
Cheers,
Sherban
__________________________________________________________________________
This email message is for the sole use of the intended recipient(s) and
may contain confidential information. Any unauthorized review, use,
disclosure or distribution is prohibited. If you are not the intended
recipient, please contact the sender by reply email and destroy all copies
of the original message.
Re: Unable to use nutch 2.3 crawl script for MySQL, Mongo, or
Cassandra
Posted by "Drulea, Sherban" <sd...@rand.org>.
Uncommenting <copyField source="rawcontent" dest="text”/> in schema.xml
fixed the issue with SOLR.
Now there are no error messages but also no parsing :(.
My seed.txt:
---------------------------------------------------------------------------
-------
http://intranet.rand.org/eprm/rand-initiated-research/proposals/fy2015/inde
x.html
http://intranet.rand.org/eprm/rand-initiated-research/2015.html
http://intranet.rand.org/eprm/rand-initiated-research/faq.html
http://intranet.rand.org/eprm/rand-initiated-research/index.html
---------------------------------------------------------------------------
-------
My nutch-site.xml:
---------------------------------------------------------------------------
-------
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>nutch Mongo Solr Crawler</value>
</property>
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.mongodb.store.MongoStore</value>
<description>Default class for storing data</description>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index-(
basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr</v
alue>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints
plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS
please enable
protocol-httpclient, but be aware of possible intermittent
problems with the
underlying commons-httpclient library.
</description>
</property>
</configuration>
---------------------------------------------------------------------------
-------
My regex-urlfilter.txt:
---------------------------------------------------------------------------
-------
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
# for a more extensive coverage use the urlfilter-suffix plugin
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP
|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bm
p|BMP|js|JS)$
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
# skip URLs with slash-delimited segment that repeats 3+ times, to break
loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
# accept anything else
+.
—————————————————————————————————————————
I see these warnings in my hadoop.log:
2015-09-30 17:32:53,466 WARN util.NativeCodeLoader - Unable to load
native-hadoop library for your platform... using builtin-java classes
where applicable
015-09-30 17:32:54,571 WARN conf.Configuration -
file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1728069154/.staging/job_loca
l1728069154_0001/job.xml:an attempt to override final parameter:
mapreduce.job.end-notification.max.retry.interval; Ignoring.
2015-09-30 17:32:54,573 WARN conf.Configuration -
file:/tmp/hadoop-sdrulea/mapred/staging/sdrulea1728069154/.staging/job_loca
l1728069154_0001/job.xml:an attempt to override final parameter:
mapreduce.job.end-notification.max.attempts; Ignoring.
2015-09-30 17:32:54,652 WARN conf.Configuration -
file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local17280691
54_0001/job_local1728069154_0001.xml:an attempt to override final
parameter: mapreduce.job.end-notification.max.retry.interval; Ignoring.
2015-09-30 17:32:54,654 WARN conf.Configuration -
file:/tmp/hadoop-sdrulea/mapred/local/localRunner/sdrulea/job_local17280691
54_0001/job_local1728069154_0001.xml:an attempt to override final
parameter: mapreduce.job.end-notification.max.attempts; Ignoring.
Any ideas?
On 9/30/15, 5:21 PM, "Drulea, Sherban" <sd...@rand.org> wrote:
>Hi Lewis,
>
>
>On 9/30/15, 11:05 AM, "Lewis John Mcgibbney" <le...@gmail.com>
>wrote:
>
>>Hi Sherban,
>>
>>On Wed, Sep 30, 2015 at 6:46 AM, <us...@nutch.apache.org>
>>wrote:
>>
>>>
>>> I tried with SOLR 4.9.1.
>>>
>>
>>OK. As I said Solr 4.6 is supported but never mind.
>
>OK. I¹m using SOLR 4.6.0.
>
>I replaced solr-4.6.0/example/solr/collection1/conf/schema.xml with file
>from https://github.com/apache/nutch/blob/2.x/conf/schema.xml.
>
>When I start SOLR 4.6.0. With "java -jar start.jar², I get this error:
>1094 [coreLoadExecutor-3-thread-1] INFO
>org.apache.solr.update.SolrIndexConfig IndexWriter infoStream solr
>logging is enabled
>1097 [coreLoadExecutor-3-thread-1] INFO org.apache.solr.core.SolrConfig
> Using Lucene MatchVersion: LUCENE_46
>1160 [coreLoadExecutor-3-thread-1] INFO org.apache.solr.core.Config
>Loaded SolrConfig: solrconfig.xml
>1164 [coreLoadExecutor-3-thread-1] INFO
>org.apache.solr.schema.IndexSchema Reading Solr Schema from schema.xml
>1176 [coreLoadExecutor-3-thread-1] INFO
>org.apache.solr.schema.IndexSchema [collection1] Schema name=nutch
>1241 [coreLoadExecutor-3-thread-1] INFO
>org.apache.solr.schema.IndexSchema default search field in schema is
>text
>1242 [coreLoadExecutor-3-thread-1] INFO
>org.apache.solr.schema.IndexSchema query parser default operator is OR
>1242 [coreLoadExecutor-3-thread-1] INFO
>org.apache.solr.schema.IndexSchema unique key field: id
>1243 [coreLoadExecutor-3-thread-1] ERROR
>org.apache.solr.core.CoreContainer Unable to create core: collection1
>org.apache.solr.common.SolrException: copyField source :'rawcontent' is
>not a glob and doesn't match any explicit field or dynamicField.. Schema
>file is
>/Users/sdrulea/Downloads/solr-4.6.0/example/solr/collection1/schema.xml
> at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:608)
> at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:166)
> at
>org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:5
>5
>)
> at
>org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFact
>o
>ry.java:69)
> at
>org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:554)
> at org.apache.solr.core.CoreContainer.create(CoreContainer.java:592)
> at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:271)
> at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:263)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
>java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
>java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
>1
>142)
> at
>java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java
>:
>617)
> at java.lang.Thread.run(Thread.java:745)
>Caused by: org.apache.solr.common.SolrException: copyField source
>:'rawcontent' is not a glob and doesn't match any explicit field or
>dynamicField.
> at
>org.apache.solr.schema.IndexSchema.registerCopyField(IndexSchema.java:855)
> at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:592)
> ... 13 more
>1245 [coreLoadExecutor-3-thread-1] ERROR
>org.apache.solr.core.CoreContainer
>null:org.apache.solr.common.SolrException: Unable to create core:
>collection1
> at
>org.apache.solr.core.CoreContainer.recordAndThrow(CoreContainer.java:977)
> at org.apache.solr.core.CoreContainer.create(CoreContainer.java:601)
> at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:271)
> at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:263)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
>java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
>java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
>1
>142)
> at
>java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java
>:
>617)
> at java.lang.Thread.run(Thread.java:745)
>Caused by: org.apache.solr.common.SolrException: copyField source
>:'rawcontent' is not a glob and doesn't match any explicit field or
>dynamicField.. Schema file is
>/Users/sdrulea/Downloads/solr-4.6.0/example/solr/collection1/schema.xml
> at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:608)
> at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:166)
> at
>org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:5
>5
>)
> at
>org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFact
>o
>ry.java:69)
> at
>org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:554)
> at org.apache.solr.core.CoreContainer.create(CoreContainer.java:592)
> ... 8 more
>Caused by: org.apache.solr.common.SolrException: copyField source
>:'rawcontent' is not a glob and doesn't match any explicit field or
>dynamicField.
> at
>org.apache.solr.schema.IndexSchema.registerCopyField(IndexSchema.java:855)
> at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:592)
> ... 13 more
>
>1247 [main] INFO org.apache.solr.servlet.SolrDispatchFilter
>user.dir=/Users/sdrulea/Downloads/solr-4.6.0/example
>1247 [main] INFO org.apache.solr.servlet.SolrDispatchFilter
>SolrDispatchFilter.init() done
>1263 [main] INFO org.eclipse.jetty.server.AbstractConnector Started
>SocketConnector@0.0.0.0:8983
>
>
>The only changes I made to schema.xml were to comment out lines with
>³protwords.txt² as the tutorial suggested. Has anyone tested the 2.3.1
>schema.xml with SOLR 4.6.1?
>
>>
>>
>>>
>>> I copied /release-2.3.1/runtime/local/conf/schema.xml to
>>> solr-4.9.1/example/solr/collection1/conf/schema.xml
>>>
>>
>>Good.
>>
>>
>>>
>>> Result of /release-2.3.1/runtime/local/bin/crawl urls method_centers
>>> http://localhost:8983/solr 2
>>>
>>>
>>> InjectorJob: total number of urls rejected by filters: 1
>>>
>>
>>Notice that you regex urlfilter is rejecting one of your seed URLs.
>
>One of my original URLs ended with ³/". I added index.html and that fixed
>the rejection.
>
>InjectorJob: total number of urls rejected by filters: 0
>InjectorJob: total number of urls injected after normalization and
>filtering: 11
>
>
>>
>>
>>> InjectorJob: total number of urls injected after normalization and
>>> filtering: 5
>>>
>>
>>[...snip]
>>
>>GeneratorJob: generated batch id: 1443556518-1067112789 containing 0 URLs
>>> Generate returned 1 (no new segments created)
>>> Escaping loop: no more URLs to fetch now
>>>
>>> There are 6 URLs in my urls/seeds.txt file. Why does it say 0 URLs?
>>>
>>
>>1 was rejected as explained above. Additionally, it seems like there is
>>also an error fetching your seeds and parsing out hyperlinks. I would
>>encourage you to check the early stages of configuring and prepping your
>>crawler. Some configuration is incorrect... possibly more problems with
>>your regex urlfilters.
>
>My regex-urlfilter.txt is unmodified:
># skip file: ftp: and mailto: urls
>-^(file|ftp|mailto):
>
># skip image and other suffixes we can't yet parse
># for a more extensive coverage use the urlfilter-suffix plugin
>-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZI
>P
>|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|b
>m
>p|BMP|js|JS)$
>
># skip URLs containing certain characters as probable queries, etc.
>-[?*!@=]
>
># skip URLs with slash-delimited segment that repeats 3+ times, to break
>loops
>-.*(/[^/]+)/[^/]+\1/[^/]+\1/
>
># accept anything else
>+.
>
>
>I copied plugin.includes to local/conf/nutch-site.xml. I aded httpclient &
>indexer-solr
><property>
> <name>plugin.includes</name>
>
><value>protocol-(http|httpclient)|urlfilter-regex|parse-(html|tika)|index-
>(
>basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|indexer-solr</
>v
>alue>
>
> <description>Regular expression naming plugin directory names to
> include. Any plugin not matching this expression is excluded.
> In any case you need at least include the nutch-extensionpoints
>plugin. By
> default Nutch includes crawling just HTML and plain text via
>HTTP,
> and basic indexing and search plugins. In order to use HTTPS
>please enable
> protocol-httpclient, but be aware of possible intermittent
>problems with the
> underlying commons-httpclient library.
> </description>
> </property>
>
>
>Nutch still doesn¹t parse any links. Any ideas?
>
>InjectorJob: total number of urls injected after normalization and
>filtering: 11
>/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch fetch -D
>mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>mapred.reduce.tasks.speculative.execution=false -D
>mapred.map.tasks.speculative.execution=false -D
>mapred.compress.map.output=true -D fetcher.timelimit.mins=180
>1443657910-4394 -crawlId method_centers -threads 50
>FetcherJob: starting at 2015-09-30 17:05:14
>FetcherJob: batchId: 1443657910-4394
>FetcherJob: threads: 50
>FetcherJob: parsing: false
>FetcherJob: resuming: false
>FetcherJob : timelimit set for : 1443668714323
>Using queue mode : byHost
>Fetcher: threads: 50
>QueueFeeder finished: total 0 records. Hit by time limit :0
>Š.
>Fetcher: throughput threshold sequence: 5
>0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs
>in 0 queues
>
>
>
>
>-activeThreads=0
>Using queue mode : byHost
>Fetcher: threads: 50
>QueueFeeder finished: total 0 records. Hit by time limit :0
>Š.
>
>-finishing thread FetcherThread49, activeThreads=0
>0/0 spinwaiting/active, 0 pages, 0 errors, 0.0 0 pages/s, 0 0 kb/s, 0 URLs
>in 0 queues
>
>
>Parsing :
>/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch parse -D
>mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>mapred.reduce.tasks.speculative.execution=false -D
>mapred.map.tasks.speculative.execution=false -D
>mapred.compress.map.output=true -D
>mapred.skip.attempts.to.start.skipping=2 -D
>mapred.skip.map.max.skip.records=1 1443657910-4394 -crawlId method_centers
>ParserJob: starting at 2015-09-30 17:05:27
>ParserJob: resuming: false
>ParserJob: forced reparse: false
>ParserJob: batchId: 1443657910-4394
>ParserJob: success
>ParserJob: finished at 2015-09-30 17:05:29, time elapsed: 00:00:02
>CrawlDB update for method_centers
>/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch updatedb -D
>mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>mapred.reduce.tasks.speculative.execution=false -D
>mapred.map.tasks.speculative.execution=false -D
>mapred.compress.map.output=true 1443657910-4394 -crawlId method_centers
>DbUpdaterJob: starting at 2015-09-30 17:05:30
>DbUpdaterJob: batchId: 1443657910-4394
>DbUpdaterJob: finished at 2015-09-30 17:05:32, time elapsed: 00:00:02
>Indexing method_centers on SOLR index -> http://localhost:8983/solr
>/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch index -D
>mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>mapred.reduce.tasks.speculative.execution=false -D
>mapred.map.tasks.speculative.execution=false -D
>mapred.compress.map.output=true -D
>solr.server.url=http://localhost:8983/solr -all -crawlId method_centers
>
>
>
>
>>
>>
>>>
>>>
>>> The index job worked but there¹s no data in SOLR. Is there a known good
>>> version of SOLR that works with 2.3.1 schema.xml? Are the tutorial
>>> instructions still valid?
>>>
>>
>>Not it did not. It failed. Look at the hadoop.log.
>>Also please look at your solr.log, it will provide you with better
>>insight
>>into what is wrong with your Solr server and what messages are failing.
>>Thanks
>
>The nutch schema.xml doesn¹t work on my SOLR 4.6.0:
>
>IndexingJob: starting
>No IndexWriters activated - check your configuration
>
>IndexingJob: done.
>SOLR dedup -> http://localhost:8983/solr
>/Users/sdrulea/svn/release-2.3.1/runtime/local/bin/nutch solrdedup -D
>mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D
>mapred.reduce.tasks.speculative.execution=false -D
>mapred.map.tasks.speculative.execution=false -D
>mapred.compress.map.output=true http://localhost:8983/solr
>Exception in thread "main"
>org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
>Expected content type application/octet-stream but got
>text/html;charset=ISO-8859-1. <html>
><head>
><meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/>
><title>Error 500 {msg=SolrCore 'collection1' is not available due to init
>failure: copyField source :'rawcontent' is not a glob and doesn't match
>any explicit field or dynamicField.. Schema file is
>/Users/sdrulea/Downloads/solr-4.6.0/example/solr/collection1/schema.xml,tr
>a
>ce=org.apache.solr.common.SolrException: SolrCore 'collection1' is not
>available due to init failure: copyField source :'rawcontent' is not a
>glob and doesn't match any explicit field or dynamicField.. Schema file is
>/Users/sdrulea/Downloads/solr-4.6.0/example/solr/collection1/schema.xml
> at org.apache.solr.core.CoreContainer.getCore(CoreContainer.java:818)
> at
>org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.jav
>a
>:297)
> at
>org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.jav
>a
>:197)
> at
>org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandl
>e
>r.java:1419)
> at
>org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
> at
>org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:1
>3
>7)
> at
>org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557
>)
> at
>org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.ja
>v
>a:231)
> at
>org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.ja
>v
>a:1075)
> at
>org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
> at
>org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.jav
>a
>:193)
> at
>org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.jav
>a
>:1009)
> at
>org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:1
>3
>5)
> at
>org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHa
>n
>dlerCollection.java:255)
> at
>org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollectio
>n
>.java:154)
> at
>org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java
>:
>116)
> at org.eclipse.jetty.server.Server.handle(Server.java:368)
> at
>org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttp
>C
>onnection.java:489)
> at
>org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttp
>C
>onnection.java:53)
> at
>org.eclipse.jetty.server.AbstractHttpConnection.headerComplete(AbstractHtt
>p
>Connection.java:942)
> at
>org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.headerCompl
>e
>te(AbstractHttpConnection.java:1004)
> at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:640)
> at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:235)
> at
>org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnect
>i
>on.java:72)
> at
>org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketC
>o
>nnector.java:264)
> at
>org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.jav
>a
>:608)
> at
>org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java
>:
>543)
> at java.lang.Thread.run(Thread.java:745)
>Caused by: org.apache.solr.common.SolrException: copyField source
>:'rawcontent' is not a glob and doesn't match any explicit field or
>dynamicField.. Schema file is
>/Users/sdrulea/Downloads/solr-4.6.0/example/solr/collection1/schema.xml
> at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:608)
> at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:166)
> at
>org.apache.solr.schema.IndexSchemaFactory.create(IndexSchemaFactory.java:5
>5
>)
> at
>org.apache.solr.schema.IndexSchemaFactory.buildIndexSchema(IndexSchemaFact
>o
>ry.java:69)
> at
>org.apache.solr.core.CoreContainer.createFromLocal(CoreContainer.java:554)
> at org.apache.solr.core.CoreContainer.create(CoreContainer.java:592)
> at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:271)
> at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:263)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
>java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
>java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:
>1
>142)
> at
>java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java
>:
>617)
> ... 1 more
>Caused by: org.apache.solr.common.SolrException: copyField source
>:'rawcontent' is not a glob and doesn't match any explicit field or
>dynamicField.
> at
>org.apache.solr.schema.IndexSchema.registerCopyField(IndexSchema.java:855)
> at org.apache.solr.schema.IndexSchema.readSchema(IndexSchema.java:592)
> ... 13 more
>
>
>
>Cheers,
>Sherban
>
>
>__________________________________________________________________________
>
>This email message is for the sole use of the intended recipient(s) and
>may contain confidential information. Any unauthorized review, use,
>disclosure or distribution is prohibited. If you are not the intended
>recipient, please contact the sender by reply email and destroy all copies
>of the original message.
>