You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by "krauss@gds2.de" <kr...@gds2.de> on 2020/01/17 16:25:04 UTC
Indexing HTML Metatags Nutch - SOLR
Hello,
I have been trying this for several days without success. (nutch 1.16 - solr
7.3.1)
I have followed this description:
https://cwiki.apache.org/confluence/display/nutch/IndexMetatags
Below I put my file nutch-site.xml
I have created the core following this description:
https://cwiki.apache.org/confluence/display/nutch/NutchTutorial/
By the way without the metatags everything works fine.
Bevor creating the core I deleted the managed-schema.xml and inserted my
metatag fields into schema.xml in the configsets directory of the core
<field name="metatag.SITdescription" type="text_general" stored="true"
indexed="true" multiValued="true"/>
<field name="metatag.SITkeywords" type="text_general" stored="true"
indexed="true" multiValued="true"/>
First Question: After creating the core I see a managed-schema.xml file and
a schema.xml.bak file in the conf directory of the core. Sorry I am new to
this, but I believe I do not want managed-schema.xml??? (See description
above)
Anyway when I run the crawl all is ok until the index is created. Then I end
up with the error:
org.apache.solr.common.SolrException: copyField dest
:'metatag.SITdescription_str' is not an explicit field and doesn't match a
dynamicField.
at
org.apache.solr.schema.IndexSchema.registerCopyField(IndexSchema.java:902)
at
org.apache.solr.schema.ManagedIndexSchema.addCopyFields(ManagedIndexSchema.java:784)
There is no copyfield instruction for metatag.SITdescription in
managed-schema.xml. I even created a field "metatag.SITdescription_str" in
managed-schema.xml which did not help.
Can you help me please
Best Regards
Martin
nutch-site.xml
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
<property>
<name>http.agent.name</name>
<value>SIT_NUTCH_SPIDER</value>
</property>
<property>
<name>db.ignore.external.links</name>
<value>true</value>
<description>If true, outlinks leading from a page to external hosts will be
ignored. This is an effective way to limit the crawl to include only
initially injected hosts, without creating complex URLFilters.
</description>
</property>
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-(regex|validator)|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
By default Nutch includes plugins to crawl HTML and various other
document formats via HTTP/HTTPS and indexing the crawled content
into Solr. More plugins are available to support more indexing
backends, to fetch ftp:// and file:// URLs, for focused crawling,
and many other use cases.
</description>
</property>
<property>
<name>http.robot.rules.whitelist</name>
<value>sitlux02.sit.de</value>
<description>Comma separated list of hostnames or IP addresses to ignore
robot rules parsing for.
</description>
</property>
<property>
<name>metatags.names</name>
<value>SITdescription,SITkeywords,SITcategory,SITintern</value>
<description> Names of the metatags to extract, separated by ','.
Use '*' to extract all metatags. Prefixes the names with 'metatag.'
in the parse-metadata. For instance to index description and keywords,
you need to activate the plugin index-metadata and set the value of the
parameter 'index.parse.md' to 'metatag.description,metatag.keywords'.
</description>
</property>
<property>
<name>index.parse.md</name>
<value>metatag.SITdescription,metatag.SITkeywords,metatag.SITcategory,metatag.SITintern</value>
<description>
Comma-separated list of keys to be taken from the parse metadata to
generate fields.
Can be used e.g. for 'description' or 'keywords' provided that these
values are generated
by a parser (see parse-metatags plugin)
</description>
</property>
<property>
<name>index.metadata</name>
<value>metatag.SITdescription,metatag.SITkeywords,metatag.SITcategory,metatag.SITintern</value>
<description>
Comma-separated list of keys to be taken from the metadata to generate
fields.
Can be used e.g. for 'description' or 'keywords' provided that these
values are generated
by a parser (see parse-metatags plugin), and property 'metatags.names'.
</description>
</property>
</configuration>
--
Sent from: https://lucene.472066.n3.nabble.com/Solr-User-f472068.html