You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Julien Nioche (JIRA)" <ji...@apache.org> on 2010/04/02 16:16:27 UTC
[jira] Created: (NUTCH-809) Parse-metatags plugin
Parse-metatags plugin
---------------------
Key: NUTCH-809
URL: https://issues.apache.org/jira/browse/NUTCH-809
Project: Nutch
Issue Type: New Feature
Components: parser
Reporter: Julien Nioche
Assignee: Julien Nioche
Attachments: NUTCH-809.patch
h2. Parse-metatags plugin
*NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see [TIKA-379]).*
To use the legacy HTML parser specify in parse-plugins.xml
{code:xml}
<mimeType name="text/html">
<plugin id="parse-html" />
</mimeType>
{code}
The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'.
In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml
{code:xml}
<property>
<name>metatags.names</name>
<value>description;keywords</value>
</property>
{code}
The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued.
The MetaTagsQueryFilter allows to include the fields above in the Nutch queries.
This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-809) Parse-metatags plugin
Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Julien Nioche updated NUTCH-809:
--------------------------------
Description:
h2. Parse-metatags plugin
The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'.
In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml
{code:xml}
<property>
<name>metatags.names</name>
<value>description;keywords</value>
</property>
{code}
The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued.
The query-basic plugin is used to include these fields in the search e.g. in nutch-site.xml
{code:xml}
<property>
<name>query.basic.description.boost</name>
<value>2.0</value>
</property>
<property>
<name>query.basic.keywords.boost</name>
<value>2.0</value>
</property>
{code}
This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com
was:
h2. Parse-metatags plugin
The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'.
In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml
{code:xml}
<property>
<name>metatags.names</name>
<value>description;keywords</value>
</property>
{code}
The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued.
The MetaTagsQueryFilter allows to include the fields above in the Nutch queries.
This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com
> Parse-metatags plugin
> ---------------------
>
> Key: NUTCH-809
> URL: https://issues.apache.org/jira/browse/NUTCH-809
> Project: Nutch
> Issue Type: New Feature
> Components: parser
> Reporter: Julien Nioche
> Assignee: Julien Nioche
> Attachments: NUTCH-809.patch
>
>
> h2. Parse-metatags plugin
> The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'.
> In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml
> {code:xml}
> <property>
> <name>metatags.names</name>
> <value>description;keywords</value>
> </property>
> {code}
> The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued.
> The query-basic plugin is used to include these fields in the search e.g. in nutch-site.xml
> {code:xml}
> <property>
> <name>query.basic.description.boost</name>
> <value>2.0</value>
> </property>
> <property>
> <name>query.basic.keywords.boost</name>
> <value>2.0</value>
> </property>
> {code}
> This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-809) Parse-metatags plugin
Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Julien Nioche updated NUTCH-809:
--------------------------------
Description:
h2. Parse-metatags plugin
The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'.
In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml
{code:xml}
<property>
<name>metatags.names</name>
<value>description;keywords</value>
</property>
{code}
The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued.
The MetaTagsQueryFilter allows to include the fields above in the Nutch queries.
This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com
was:
h2. Parse-metatags plugin
*NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see [TIKA-379]).*
To use the legacy HTML parser specify in parse-plugins.xml
{code:xml}
<mimeType name="text/html">
<plugin id="parse-html" />
</mimeType>
{code}
The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'.
In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml
{code:xml}
<property>
<name>metatags.names</name>
<value>description;keywords</value>
</property>
{code}
The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued.
The MetaTagsQueryFilter allows to include the fields above in the Nutch queries.
This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com
> Parse-metatags plugin
> ---------------------
>
> Key: NUTCH-809
> URL: https://issues.apache.org/jira/browse/NUTCH-809
> Project: Nutch
> Issue Type: New Feature
> Components: parser
> Reporter: Julien Nioche
> Assignee: Julien Nioche
> Attachments: NUTCH-809.patch
>
>
> h2. Parse-metatags plugin
> The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'.
> In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml
> {code:xml}
> <property>
> <name>metatags.names</name>
> <value>description;keywords</value>
> </property>
> {code}
> The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued.
> The MetaTagsQueryFilter allows to include the fields above in the Nutch queries.
> This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-809) Parse-metatags plugin
Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Julien Nioche updated NUTCH-809:
--------------------------------
Attachment: NUTCH-809.patch
> Parse-metatags plugin
> ---------------------
>
> Key: NUTCH-809
> URL: https://issues.apache.org/jira/browse/NUTCH-809
> Project: Nutch
> Issue Type: New Feature
> Components: parser
> Reporter: Julien Nioche
> Assignee: Julien Nioche
> Attachments: NUTCH-809.patch
>
>
> h2. Parse-metatags plugin
> *NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see [TIKA-379]).*
> To use the legacy HTML parser specify in parse-plugins.xml
> {code:xml}
> <mimeType name="text/html">
> <plugin id="parse-html" />
> </mimeType>
> {code}
> The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'.
> In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml
> {code:xml}
> <property>
> <name>metatags.names</name>
> <value>description;keywords</value>
> </property>
> {code}
> The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued.
> The MetaTagsQueryFilter allows to include the fields above in the Nutch queries.
> This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-809) Parse-metatags plugin
Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Julien Nioche updated NUTCH-809:
--------------------------------
Attachment: (was: NUTCH-809.patch)
> Parse-metatags plugin
> ---------------------
>
> Key: NUTCH-809
> URL: https://issues.apache.org/jira/browse/NUTCH-809
> Project: Nutch
> Issue Type: New Feature
> Components: parser
> Reporter: Julien Nioche
> Assignee: Julien Nioche
>
> h2. Parse-metatags plugin
> *NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see [TIKA-379]).*
> To use the legacy HTML parser specify in parse-plugins.xml
> {code:xml}
> <mimeType name="text/html">
> <plugin id="parse-html" />
> </mimeType>
> {code}
> The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'.
> In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml
> {code:xml}
> <property>
> <name>metatags.names</name>
> <value>description;keywords</value>
> </property>
> {code}
> The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued.
> The MetaTagsQueryFilter allows to include the fields above in the Nutch queries.
> This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
[jira] Updated: (NUTCH-809) Parse-metatags plugin
Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.
[ https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Julien Nioche updated NUTCH-809:
--------------------------------
Attachment: NUTCH-809.patch
Modified version of the plugin which is compatible with parse-tika
> Parse-metatags plugin
> ---------------------
>
> Key: NUTCH-809
> URL: https://issues.apache.org/jira/browse/NUTCH-809
> Project: Nutch
> Issue Type: New Feature
> Components: parser
> Reporter: Julien Nioche
> Assignee: Julien Nioche
> Attachments: NUTCH-809.patch
>
>
> h2. Parse-metatags plugin
> *NOTE: THIS PLUGIN DOES NOT WORK WITH THE CURRENT VERSION OF PARSE-TIKA (see [TIKA-379]).*
> To use the legacy HTML parser specify in parse-plugins.xml
> {code:xml}
> <mimeType name="text/html">
> <plugin id="parse-html" />
> </mimeType>
> {code}
> The parse-metatags plugin consists of a HTMLParserFilter which takes as parameter a list of metatag names with '*' as default value. The values are separated by ';'.
> In order to extract the values of the metatags description and keywords, you must specify in nutch-site.xml
> {code:xml}
> <property>
> <name>metatags.names</name>
> <value>description;keywords</value>
> </property>
> {code}
> The MetatagIndexer uses the output of the parsing above to create two fields 'keywords' and 'description'. Note that keywords is multivalued.
> The MetaTagsQueryFilter allows to include the fields above in the Nutch queries.
> This code has been developed by DigitalPebble Ltd and offered to the community by ANT.com
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.