You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Lewis John McGibbney (JIRA)" <ji...@apache.org> on 2011/07/16 20:49:59 UTC

[jira] [Commented] (NUTCH-62) Add html META tag information into metaData in index-more plugin

    [ https://issues.apache.org/jira/browse/NUTCH-62?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13066494#comment-13066494 ] 

Lewis John McGibbney commented on NUTCH-62:
-------------------------------------------

There are various comments above which create slight confusion about what to do to resolve this issue... or infact what exactly the issue is that needs to be resolved!

Is there a requirement to rework the htmlMetaProcessor class to incorporate the suggestions above e.g. "consistent schema in both cases..."

Protocol.metadata aside, what we are essentially talking about is picking up all Parsedata.metadata included within meta tags which I assume we would wish to index at a later stage. Focussing on the HTMLMetaProcessor class we already acquire name, http-equiv and content attributes from meta tags. WOuld an improvement be to configure the class to pick up other attributes not already mentioned?

> Add html META tag information into metaData in index-more plugin
> ----------------------------------------------------------------
>
>                 Key: NUTCH-62
>                 URL: https://issues.apache.org/jira/browse/NUTCH-62
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>            Reporter: Jack Tang
>            Priority: Trivial
>         Attachments: index-more.patch.zip
>
>
> Now(version dev-0.7), only some metaData  in http response such as type, date, content-length are available int the index-more plugin. And we cannot index/sotre the meta data in html header (<META> exactly)

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira