You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Sebastian Nagel (JIRA)" <ji...@apache.org> on 2018/06/28 11:00:00 UTC
[jira] [Updated] (NUTCH-2387) Nutch should not index document with
"noindex" meta
[ https://issues.apache.org/jira/browse/NUTCH-2387?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel updated NUTCH-2387:
-----------------------------------
Fix Version/s: (was: 1.15)
1.16
> Nutch should not index document with "noindex" meta
> ---------------------------------------------------
>
> Key: NUTCH-2387
> URL: https://issues.apache.org/jira/browse/NUTCH-2387
> Project: Nutch
> Issue Type: Bug
> Components: indexer
> Affects Versions: 1.13
> Environment: Linux mint 18,
> Reporter: Eyeris Rodriguez Rueda
> Priority: Major
> Labels: index, meta, robots,
> Fix For: 1.16
>
>
> I'm using nutch 1.12 in local mode and solr 4.10.3.
> For some reason i have detected that nutch index document with "noindex" robots meta.
> I have use nutch script for a complete cycle:
> bin/crawl -i urls/ crawl/ -2
> with this url:
> https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
> After various testing the problem persist and aproximately 200 document with this robots meta are indexed incorrectly.
> I have read the method configure in IndexerMapReduce.java class and it has a line for that property but for some reason it is not doing appropiately.
> this.deleteRobotsNoIndex = job.getBoolean(INDEXER_DELETE_ROBOTS_NOINDEX,false); (line 97)
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)