You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2012/04/04 16:46:07 UTC

[Nutch Wiki] Update of "IndexMetatags" by JulienNioche

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "IndexMetatags" page has been changed by JulienNioche:
http://wiki.apache.org/nutch/IndexMetatags?action=diff&rev1=1&rev2=2

  = Nutch - Parse Metatags =
- 
- '''Summary:''' When crawling HTML pages, it might be necessary to retrieve information which is stored in HTML Meta tags. This tutorial shows how to install the plugin and configure Nutch to parse meta tags into separate fields in the Solr index. Note that Nutch pushes the information to Solr, so this tutorial also includes the changes required to Solr.
+ '''Summary:''' When crawling HTML pages, it might be necessary to retrieve information which is stored in HTML Meta tags. This tutorial shows how to install the plugin and configure Nutch to parse meta tags into separate fields in the Solr index. Note that Nutch pushes the information to Solr, so this tutorial also includes the changes required to Solr. This article relates to the parse`-metatags` plugin, provided in jira:  https://issues.apache.org/jira/browse/NUTCH-809
- This article relates to the `index-metatags` plugin, provided in jira:  https://issues.apache.org/jira/browse/NUTCH-809
  
  == Plugin Information ==
+ This plugin has been committed to the trunk in revision 1303371 and will be available in Nutch 1.5. It parses specified meta tags and relies on the index`-metadata `plugin.
- This plugin has been developed as patch for Nutch 1.3. It parses specified meta tags and stores them in separate fields in the Solr Index.
- 
- == Prerequisites ==
- Solr and Nutch should already be set up. `%NUTCH_HOME%` is used as reference to your Nutch installation directory.
- 
- == Plugin Installation ==
- There are two possibilities to install this plugin: by adding the relevant jar files to an existing Nutch installation or by applying a patch to the Nutch code and building Nutch completely new. In most use cases, you only need to copy the relevant files instead of building Nutch.
- 
- '''Option 1:''' Adding the relevant files to existing Nutch
-  1. Use the zip file containing the plugin "index-metatags.zip", which is provided in Jira: https://issues.apache.org/jira/browse/NUTCH-809
-  1. Extract the zip file.
-  1. Put the folder index-metatags into `%NUTCH_HOME%/plugins`.
- 
- '''Option 2:''' Applying the patch to the code and build Nutch
-  1. Download the patch file "NUTCH-809_metatags_1.3.patch" from Jira: https://issues.apache.org/jira/browse/NUTCH-809
-  1. Download the Nutch source code from [[[here|http://nutch.apache.org/version_control.html]]].
-  1. Apply the patch to the code - there is a new plugin called `indexmetatags` available.
-  1. Build the Nutch tar by running the Ivy/Ant goals `runtime` and `tar`.
-  1. Set up Nutch.
  
  == Plugin Configuration ==
-  1. In the file `conf/nutch-site.xml`, edit the property `plugin.includes` to contain the following plugin: `|index-metatags`, so it looks like for example:{{{
+  1. In the file `conf/nutch-site.xml`, edit the property `plugin.includes` to contain the following plugins: `parse-metatags` and index`-metadata` so it looks like for example:
+ 
+  {{{
  <property>
  <name>plugin.includes</name>
+ <value>protocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|metadata)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
- <value>protocol-http|urlfilter-regex|parse-(html|tika|js|zip)|index-
- (basic|anchor|metatags)|query-(basic|site|url)|response-
- (json|xml)|summary-basic|scoring-opic|urlnormalizer-
- (pass|regex|basic)</value>
- </property>}}}
+ </property>
+ }}}
-  1. In the file `conf/nutch-site.xml`, specify which metatags should be indexed. Either specify specific metatags you want to index, or you can index all metatags. To index all, provide a '*' for the value of the property "metatags.names", otherwise provide the list of names separated by ';'. For example, to only index the metatag 'role', add the following configuration to conf/nutch-site.xml: {{{
+  1. In the file `conf/nutch-site.xml`, specify which metatags should be indexed. Either specify specific metatags you want to index, or you can index all metatags. To index all, provide a '*' for the value of the property "metatags.names", otherwise provide the list of names separated by ';'. For example, to only index the metatag 'role', add the following configuration to conf/nutch-site.xml:
+ 
+  {{{
  <!-- Used only if plugin parse-metatags is enabled. -->
  <property>
  <name>metatags.names</name>
- <value>role</value>
- <description>For plugin parse-metatags: Indicate here the name of the
- html meta tag that should be
- parsed. Use a semicolon separated list if you want multiple
- tags, or use '*' to index all.
- Example: description;keywords;role
+ <value>description;keywords</value>
+ <description> Names of the metatags to extract, separated by;.
+   Use '*' to extract all metatags. Prefixes the names with 'metatag.'
+   in the parse-metadata. For instance to index description and keywords,
+   you need to activate the plugin index-metadata and set the value of the
+   parameter 'index.parse.md' to 'metatag.description;metatag.keywords'.
  </description>
  </property>
  }}}
+  1. In the same file you need to configure the  index`-metadata `plugin. The values are stored in the parse metadata so we need to specify :
+ 
+  {{{
+ <property>
+   <name>index.parse.md</name>
+   <value>metatag.description,metatag.keywords</value>
+   <description>
+   Comma-separated list of keys to be taken from the parse metadata to generate fields.
+   Can be used e.g. for 'description' or 'keywords' provided that these values are generated
+   by a parser (see parse-metatags plugin)
+   </description>
+ </property>
+ }}}
+  '''CAUTION : '''the names of the fields must be prefixed with 'metatags.'
+  1. You can test that the fields are generated correctly by using the IndexingFiltersChecker
-  1. In order to have the specified metatags indexed by Solr, edit your Solr `schema.xml` (located in `$SOLR_HOME$/conf`) and include new fields for each metatag you want to indexed. For example for the field 'role', add the following lines: {{{
+  1. In order to have the specified metatags indexed by Solr, edit your Solr `schema.xml` (located in `$SOLR_HOME$/conf`) and include new fields for each metatag you want to indexed. For example for the field 'role', add the following lines:
+ 
+  {{{
  ...
  <fields>
  ....
  <!-- fields for the metatags plugin -->
+ <field name="metatag.description" type="text" stored="true" indexed="true"/>
- <field name="role" type="String" stored="true" indexed="true"/>
+ <field name="metatag.keywords" type="text" stored="true" indexed="true"/>
  ...
  </fields>
  }}}
+  '''Note''' : you can use the file'' solrindex-mapping.xml'' to rename the fields e.g. ''<field dest="description" source="metatag.description"/>''
   1. Restart Solr to load the new configuration.
-  1. Re-index your pages by running Nutch again - the metatag should be available in the Solr index. Check the index with Luke (http://code.google.com/p/luke) to see if it is available as separate field.
+  1. Re-index your pages by running Nutch again - the metatags should be available in the Solr index. Check the index with Luke (http://code.google.com/p/luke) to see if it is available as separate field.