You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Felix Zett (JIRA)" <ji...@apache.org> on 2016/09/30 08:49:20 UTC

[jira] [Commented] (NUTCH-966) Behavior of NOINDEX,FOLLOW is not intuitive

    [ https://issues.apache.org/jira/browse/NUTCH-966?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15535447#comment-15535447 ] 

Felix Zett commented on NUTCH-966:
----------------------------------

Not index a document just because it has no Text and Title? Maybe important content was written to other fields by parse filters.

I think this solution is cleaner:
https://github.com/saintybalboa/nutchmetarobots

Same idea, but discards a document only when noindex directive is encountered.

> Behavior of NOINDEX,FOLLOW is not intuitive
> -------------------------------------------
>
>                 Key: NUTCH-966
>                 URL: https://issues.apache.org/jira/browse/NUTCH-966
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer, parser
>    Affects Versions: 1.2
>            Reporter: Josh Pavel
>            Priority: Minor
>             Fix For: 2.5
>
>
> If a page has NOINDEX,FOLLOW for the ROBOTS metatag, Nutch will still create a document that can be found in the index via metatag or URL matching.  Instead, Nutch should rely on doc or parse metadata but nothing should be stored by the html parser. (thanks to Julien Nioche for helping me to understand the issue). 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)