You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Hudson (Jira)" <ji...@apache.org> on 2020/08/16 21:08:01 UTC

[jira] [Commented] (NUTCH-2720) ROBOTS metatag ignored when capitalized

    [ https://issues.apache.org/jira/browse/NUTCH-2720?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17178603#comment-17178603 ] 

Hudson commented on NUTCH-2720:
-------------------------------

SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #3 (See [https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/3/])
NUTCH-2720 ROBOTS metatag ignored when capitalized (snagel: [https://github.com/apache/nutch/commit/508715175ad3a5cb7454f4734bb6dc870d80e7d1])
* (edit) src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java
NUTCH-2720 ROBOTS metatag ignored when capitalized (snagel: [https://github.com/apache/nutch/commit/fa319a60f30dbb0efcd67e306c611d66b7b379f1])
* (edit) src/java/org/apache/nutch/indexer/IndexerMapReduce.java
* (edit) src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HTMLMetaProcessor.java
* (edit) src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java
* (edit) src/java/org/apache/nutch/metadata/Nutch.java
* (edit) src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/HTMLMetaProcessor.java


> ROBOTS metatag ignored when capitalized
> ---------------------------------------
>
>                 Key: NUTCH-2720
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2720
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer, robots
>    Affects Versions: 1.15
>            Reporter: Felix Zett
>            Assignee: Sebastian Nagel
>            Priority: Minor
>             Fix For: 1.17
>
>         Attachments: noindex.html
>
>
> As discussed [on the mailing list|https://www.mail-archive.com/user@nutch.apache.org/msg16516.html], index-metadata fails to ignore a webpage with a capitalized robots metatag such as {{<META NAME="ROBOTS" CONTENT="NOINDEX, FOLLOW">}}. This only applies when parse-tika is used. parse-html will "decapitalize"
> Parsing the attached [^noindex.html] leads to the following results:
> *parse-html:*
> {code:java}
> bin/nutch parsechecker -Dplugin.includes="protocol-httpclient|parse-(html|metatags)|index-metadata" -Dindexer.delete.robots.noindex="true" -Dmetatags.names="robots" -Dindex.parse.md="metatag.robots" http://localhost:8080/noindex.html
> Parse Metadata: [...] metatag.robots=noindex,nofollow robots=noindex,nofollow{code}
> *parse-tika:*
> {code:java}
> bin/nutch parsechecker -Dplugin.includes="protocol-httpclient|parse-(tika|metatags)|index-metadata" -Dindexer.delete.robots.noindex="true" -Dmetatags.names="robots" -Dindex.parse.md="metatag.robots" http://localhost:8080/noindex.html
> Parse Metadata: metatag.robots=NOINDEX,NOFOLLOW  [...] ROBOTS=NOINDEX,NOFOLLOW [...]{code}
>  
> The field being named "ROBOTS" and not "robots" leads to {{parseData.getMeta("robots")}} being {{null}} in [https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/indexer/IndexerMapReduce.java#L257].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)