You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Julien Nioche (JIRA)" <ji...@apache.org> on 2010/04/02 16:16:27 UTC
[jira] Updated: (NUTCH-809) Parse-metatags plugin
[ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Julien Nioche updated NUTCH-809:
--------------------------------
Attachment: NUTCH-809.patch
> Parse-metatags plugin
> ---------------------
>
> Key: NUTCH-809
> URL: https://issues.apache.org/jira/browse/NUTCH-809
> Project: Nutch
> Issue Type: New Feature
> Components: parser
> Reporter: Julien Nioche
> Assignee: Julien Nioche
> Attachments: NUTCH-809.patch
>
>
> h2. Parse-metatags plugin
> *NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see [TIKA-379]).*
> To use the legacy HTML parser specify in parse-plugins.xml
> {code:xml}
> <mimeType name="text/html">
> <plugin id="parse-html" />
> </mimeType>
> {code}
> The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'.
> In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml
> {code:xml}
> <property>
> <name>metatags.names</name>
> <value>description;keywords</value>
> </property>
> {code}
> The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued.
> The MetaTagsQueryFilter allows to include the fields above in the Nutch queries.
> This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.