You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Tim Allison (JIRA)" <ji...@apache.org> on 2018/07/13 11:18:00 UTC

[jira] [Commented] (NUTCH-2586) Add a fallback mechanism for missing meta tags

    [ https://issues.apache.org/jira/browse/NUTCH-2586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16542898#comment-16542898 ] 

Tim Allison commented on NUTCH-2586:
------------------------------------

Is this better handled at the Tika level...or is this something we should also add to Tika?

> Add a fallback mechanism for missing meta tags
> ----------------------------------------------
>
>                 Key: NUTCH-2586
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2586
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Gerard Bouchar
>            Priority: Major
>
> While using nutch, we faced the following issue: some web pages miss a "description"  meta tag, but include an "og:description" meta (using the [open graph protocol|http://ogp.me/]).
> Here are two examples: 
> * http://imagenesdelavirgenmaria.com/17-imagenes-de-la-virgen-maria-de-guadalupe/
> * http://mixcdsource.com/product/dj-arson-dj-sin-cerothe-hit-list-18-5-reggaeton-edition/
> It would be nice to have a configurable list of fallback meta tags to use when the main meta tag is absent. Something that would allow us to specify, in the configuration, "when the 'description' meta is missing, use 'og:description', when 'title' is missing, use 'og:title', etc..." .



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)