You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Eyeris Rodriguez Rueda <er...@uci.cu> on 2017/05/10 19:00:27 UTC

problems with documents with noindex meta

Hi all.
I need some help with this problem, sorry if is a trivial things.
I have a little problem with some url that have noindex meta and are being indexed.

For example this url:
https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/

have the meta noindex and for some reason it is not deleted as well and
<meta name="robots" content="noindex,follow"/>

I have read that nutch should delete this document at the indexing time and it is not occurring correctly.

<property>
<name>indexer.delete.robots.noindex</name>
<value>true</value>
</property>

If i do a parsechecker the output has an empty content but the document it is not deleted:

fetching: https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
robots.txt whitelist not configured.
parsing: https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
contentType: text/html
date : Wed May 10 14:21:36 CDT 2017
agent : cubbot
type : text/html
type : text
type : html
title : 3
url : https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
content :
tstamp : Wed May 10 14:21:36 CDT 2017
domain : uci.cu
digest : 25ed6b1b7be4cbb69a3405f5efe2f8a2
host : humanos.uci.cu
name : 3
id : https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
lang : es

Please any help or suggestion will be appreciated.
****************************************************
Text below is autogenerated
***************************************************
La @universidad_uci es Fidel. Los jóvenes no fallaremos.
#HastaSiempreComandante
#HastalaVictoriaSiempre

Re: [MASSMAIL]Re: problems with documents with noindex meta

Posted by Sebastian Nagel <wa...@googlemail.com>.

> I have opened a jira for that.
>
> https://issues.apache.org/jira/browse/NUTCH-2387

Thanks.

Strictly speaking, "not indexing" and "deleting" are different things.
Of course, if only one single time "indexer.delete.robots.noindex" is false then
documents with robots=noindex make it into the index.

> Do you think that the responsability of delete document with noindex robots meta is for mapreduce
class or indexing filters like(index-basic or index-more) ?

I think it's the responsibility of both
- IndexingJob / IndexerMapreduce and
- "indexer" plugins (implements IndexWriter)
But there may be indexer plugins which do not support deletion of documents.
An "indexing filter" adds index fields to indexed documents.

Best,
Sebastian

On 05/18/2017 09:01 PM, Eyeris Rodriguez Rueda wrote:
> Thanks Sebastian.
> 
> I have opened a jira for that.
> 
> https://issues.apache.org/jira/browse/NUTCH-2387
> 
> Do you think that the responsability of delete document with noindex robots meta is for mapreduce class or indexing filters like(index-basic or index-more) ?
> 
> 
> 
> ----- Mensaje original -----
> De: "Sebastian Nagel" <wa...@googlemail.com>
> Para: user@nutch.apache.org
> Enviados: Jueves, 18 de Mayo 2017 11:45:43
> Asunto: Re: [MASSMAIL]Re: problems with documents with noindex meta
> 
> Hi,
> 
> sorry for the late answer...
> 
>> I have tested bin/nutch parsechecker and indexchecker with doIndex=true but the problem persist.
> 
> That's expected as indexchecker does not support deletion by robots meta.
> Could you open a Jira issue to fix this? Thanks!
> 
>> It looks like if nutch never read the property indexer.delete.robots.noindex in nutch-site.xml
> 
> The indexer job (IndexerMapreduce.java) does ...
> 
>> I have read the method configure in IndexerMapReduce.java class and it has a line for that
>> property but i dont understand why those document are indexed.
>>
>> this.deleteRobotsNoIndex = job.getBoolean(INDEXER_DELETE_ROBOTS_NOINDEX,false);   (line 97)
> 
> ok, and it should work (tested with 1.13-SNAPSHOT):
> 
> % cat > urls.txt
> https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
> ^C
> 
> % nutch inject crawldb urls.txt
> ...
> Injector: Total new urls injected: 1
> Injector: finished at 2017-05-18 17:31:16, elapsed: 00:00:01
> 
> % nutch generate crawldb segments
> ...
> 
> % nutch fetch segments/20170518173127
> ...
> fetching https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/ (queue crawl delay=5000ms)
> ...
> Fetcher: finished at 2017-05-18 17:31:42, elapsed: 00:00:07
> 
> % nutch parse segments/20170518173127
> ...
> 
> % nutch updatedb crawldb/ segments/20170518173127
> ...
> 
> nutch index -Dindexer.delete.robots.noindex=true \
>     -Dplugin.includes=indexer-dummy -Ddummy.path=index.txt \
>      crawldb/ segments/20170518173127/ -deleteGone
> Segment dir is complete: segments/20170518173127.
> Indexer: starting at 2017-05-18 17:38:52
> Indexer: deleting gone documents: true
> Indexer: URL filtering: false
> Indexer: URL normalizing: false
> Active IndexWriters :
> DummyIndexWriter
>         dummy.path : Path of the file to write to (mandatory)
> 
> 
> Indexer: number of documents indexed, deleted, or skipped:
> Indexer:      1  deleted (robots=noindex)
> Indexer: finished at 2017-05-18 17:38:53, elapsed: 00:00:01
> 
> % cat index.txt
> delete  https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
> 
> 
> Did you use -Dindexer.delete.robots.noindex=true in combination with -deleteGone?
> Otherwise no "delete" actions are performed.
> That's not really clear and also not handled the same way by all indexer plugins:
> indexer-solr does not but indexer-elastic does without.
> 
> 
> Best,
> Sebastian
> 
> 
> On 05/18/2017 02:43 PM, Eyeris Rodriguez Rueda wrote:
>> Thanks Sebastian for your answer.
>>
>> This is my environment
>> I am using nutch 1.12  solr 4.10.3  in local mode and always use the comand bin/crawl for a complete cycle.
>> For some reason all document with noindex meta are being indexed.
>>
>> I have tested bin/nutch index and the document are indexed.
>>
>> I have tested bin/nutch parsechecker and indexchecker with doIndex=true but the problem persist.
>>
>> It looks like if nutch never read the property indexer.delete.robots.noindex in nutch-site.xml
>>
>> I have read the method configure in IndexerMapReduce.java class and it has a line for that property but
>> i dont understand why those document are indexed.
>>
>> this.deleteRobotsNoIndex = job.getBoolean(INDEXER_DELETE_ROBOTS_NOINDEX,false);   (line 97)
>>
>>
>> Please i really want to solve this situation, any advice or suggestion will be appreciated.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> ----- Mensaje original -----
>> De: "Sebastian Nagel" <wa...@googlemail.com>
>> Para: user@nutch.apache.org
>> Enviados: Jueves, 11 de Mayo 2017 10:05:35
>> Asunto: [MASSMAIL]Re: problems with documents with noindex meta
>>
>> Hi,
>>
>> the indexing job ("bin/nutch index") will delete this document.
>> But it looks like that it's not done by "bin/nutch indexchecker -DdoIndex=true"
>> does not (cf. NUTCH-1758).
>>
>> Please, note that "bin/nutch parsechecker" or "indexchecker" without "doIndex"
>> will not send anything to the index.
>>
>> Best,
>> Sebastian
>>
>>
>> On 05/10/2017 09:00 PM, Eyeris Rodriguez Rueda wrote:
>>> Hi all.
>>> I need some help with this problem, sorry if is a trivial things.
>>> I have a little problem with some url that have noindex meta and are being indexed.
>>>
>>> For example this url:
>>> https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
>>>
>>> have the meta noindex and for some reason it is not deleted as well and 
>>> name="robots" content="noindex,follow"/>
>>>
>>> I have read that nutch should delete this document at the indexing time and it is not occurring correctly.
>>>
>>> <property>
>>>   <name>indexer.delete.robots.noindexname>
>>>   <value>truevalue>
>>> property>
>>>
>>> If i do a parsechecker the output has an empty content but the document it is not deleted:
>>>
>>> fetching: https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
>>> robots.txt whitelist not configured.
>>> parsing: https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
>>> contentType: text/html
>>> date :        Wed May 10 14:21:36 CDT 2017
>>> agent :        cubbot
>>> type :        text/html
>>> type :        text
>>> type :        html
>>> title :        3
>>> url :        https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
>>> content :        
>>> tstamp :        Wed May 10 14:21:36 CDT 2017
>>> domain :        uci.cu
>>> digest :        25ed6b1b7be4cbb69a3405f5efe2f8a2
>>> host :        humanos.uci.cu
>>> name :        3
>>> id :        https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
>>> lang :        es
>>>
>>> Please any help or suggestion will be appreciated.
>>
>>
>> ****************************************************
>> Text below is autogenerated
>> ***************************************************
>> La @universidad_uci es Fidel. Los jóvenes no fallaremos.
>> #HastaSiempreComandante
>> #HastalaVictoriaSiempre
>>
> 
> La @universidad_uci es Fidel. Los jóvenes no fallaremos.
> #HastaSiempreComandante
> #HastalaVictoriaSiempre
>

Re: [MASSMAIL]Re: problems with documents with noindex meta

Posted by Eyeris Rodriguez Rueda <er...@uci.cu>.

Thanks Sebastian.

I have opened a jira for that.

https://issues.apache.org/jira/browse/NUTCH-2387

Do you think that the responsability of delete document with noindex robots meta is for mapreduce class or indexing filters like(index-basic or index-more) ?



----- Mensaje original -----
De: "Sebastian Nagel" <wa...@googlemail.com>
Para: user@nutch.apache.org
Enviados: Jueves, 18 de Mayo 2017 11:45:43
Asunto: Re: [MASSMAIL]Re: problems with documents with noindex meta

Hi,

sorry for the late answer...

> I have tested bin/nutch parsechecker and indexchecker with doIndex=true but the problem persist.

That's expected as indexchecker does not support deletion by robots meta.
Could you open a Jira issue to fix this? Thanks!

> It looks like if nutch never read the property indexer.delete.robots.noindex in nutch-site.xml

The indexer job (IndexerMapreduce.java) does ...

> I have read the method configure in IndexerMapReduce.java class and it has a line for that
> property but i dont understand why those document are indexed.
>
> this.deleteRobotsNoIndex = job.getBoolean(INDEXER_DELETE_ROBOTS_NOINDEX,false);   (line 97)

ok, and it should work (tested with 1.13-SNAPSHOT):

% cat > urls.txt
https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
^C

% nutch inject crawldb urls.txt
...
Injector: Total new urls injected: 1
Injector: finished at 2017-05-18 17:31:16, elapsed: 00:00:01

% nutch generate crawldb segments
...

% nutch fetch segments/20170518173127
...
fetching https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/ (queue crawl delay=5000ms)
...
Fetcher: finished at 2017-05-18 17:31:42, elapsed: 00:00:07

% nutch parse segments/20170518173127
...

% nutch updatedb crawldb/ segments/20170518173127
...

nutch index -Dindexer.delete.robots.noindex=true \
    -Dplugin.includes=indexer-dummy -Ddummy.path=index.txt \
     crawldb/ segments/20170518173127/ -deleteGone
Segment dir is complete: segments/20170518173127.
Indexer: starting at 2017-05-18 17:38:52
Indexer: deleting gone documents: true
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
DummyIndexWriter
        dummy.path : Path of the file to write to (mandatory)


Indexer: number of documents indexed, deleted, or skipped:
Indexer:      1  deleted (robots=noindex)
Indexer: finished at 2017-05-18 17:38:53, elapsed: 00:00:01

% cat index.txt
delete  https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/


Did you use -Dindexer.delete.robots.noindex=true in combination with -deleteGone?
Otherwise no "delete" actions are performed.
That's not really clear and also not handled the same way by all indexer plugins:
indexer-solr does not but indexer-elastic does without.


Best,
Sebastian


On 05/18/2017 02:43 PM, Eyeris Rodriguez Rueda wrote:
> Thanks Sebastian for your answer.
> 
> This is my environment
> I am using nutch 1.12  solr 4.10.3  in local mode and always use the comand bin/crawl for a complete cycle.
> For some reason all document with noindex meta are being indexed.
> 
> I have tested bin/nutch index and the document are indexed.
> 
> I have tested bin/nutch parsechecker and indexchecker with doIndex=true but the problem persist.
> 
> It looks like if nutch never read the property indexer.delete.robots.noindex in nutch-site.xml
> 
> I have read the method configure in IndexerMapReduce.java class and it has a line for that property but
> i dont understand why those document are indexed.
> 
> this.deleteRobotsNoIndex = job.getBoolean(INDEXER_DELETE_ROBOTS_NOINDEX,false);   (line 97)
> 
> 
> Please i really want to solve this situation, any advice or suggestion will be appreciated.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> ----- Mensaje original -----
> De: "Sebastian Nagel" <wa...@googlemail.com>
> Para: user@nutch.apache.org
> Enviados: Jueves, 11 de Mayo 2017 10:05:35
> Asunto: [MASSMAIL]Re: problems with documents with noindex meta
> 
> Hi,
> 
> the indexing job ("bin/nutch index") will delete this document.
> But it looks like that it's not done by "bin/nutch indexchecker -DdoIndex=true"
> does not (cf. NUTCH-1758).
> 
> Please, note that "bin/nutch parsechecker" or "indexchecker" without "doIndex"
> will not send anything to the index.
> 
> Best,
> Sebastian
> 
> 
> On 05/10/2017 09:00 PM, Eyeris Rodriguez Rueda wrote:
>> Hi all.
>> I need some help with this problem, sorry if is a trivial things.
>> I have a little problem with some url that have noindex meta and are being indexed.
>>
>> For example this url:
>> https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
>>
>> have the meta noindex and for some reason it is not deleted as well and 
>> name="robots" content="noindex,follow"/>
>>
>> I have read that nutch should delete this document at the indexing time and it is not occurring correctly.
>>
>> <property>
>>   <name>indexer.delete.robots.noindexname>
>>   <value>truevalue>
>> property>
>>
>> If i do a parsechecker the output has an empty content but the document it is not deleted:
>>
>> fetching: https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
>> robots.txt whitelist not configured.
>> parsing: https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
>> contentType: text/html
>> date :        Wed May 10 14:21:36 CDT 2017
>> agent :        cubbot
>> type :        text/html
>> type :        text
>> type :        html
>> title :        3
>> url :        https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
>> content :        
>> tstamp :        Wed May 10 14:21:36 CDT 2017
>> domain :        uci.cu
>> digest :        25ed6b1b7be4cbb69a3405f5efe2f8a2
>> host :        humanos.uci.cu
>> name :        3
>> id :        https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
>> lang :        es
>>
>> Please any help or suggestion will be appreciated.
> 
> 
> ****************************************************
> Text below is autogenerated
> ***************************************************
> La @universidad_uci es Fidel. Los jóvenes no fallaremos.
> #HastaSiempreComandante
> #HastalaVictoriaSiempre
> 

La @universidad_uci es Fidel. Los jóvenes no fallaremos.
#HastaSiempreComandante
#HastalaVictoriaSiempre

Re: [MASSMAIL]Re: problems with documents with noindex meta

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi,

sorry for the late answer...

> I have tested bin/nutch parsechecker and indexchecker with doIndex=true but the problem persist.

That's expected as indexchecker does not support deletion by robots meta.
Could you open a Jira issue to fix this? Thanks!

> It looks like if nutch never read the property indexer.delete.robots.noindex in nutch-site.xml

The indexer job (IndexerMapreduce.java) does ...

> I have read the method configure in IndexerMapReduce.java class and it has a line for that
> property but i dont understand why those document are indexed.
>
> this.deleteRobotsNoIndex = job.getBoolean(INDEXER_DELETE_ROBOTS_NOINDEX,false);   (line 97)

ok, and it should work (tested with 1.13-SNAPSHOT):

% cat > urls.txt
https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
^C

% nutch inject crawldb urls.txt
...
Injector: Total new urls injected: 1
Injector: finished at 2017-05-18 17:31:16, elapsed: 00:00:01

% nutch generate crawldb segments
...

% nutch fetch segments/20170518173127
...
fetching https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/ (queue crawl delay=5000ms)
...
Fetcher: finished at 2017-05-18 17:31:42, elapsed: 00:00:07

% nutch parse segments/20170518173127
...

% nutch updatedb crawldb/ segments/20170518173127
...

nutch index -Dindexer.delete.robots.noindex=true \
    -Dplugin.includes=indexer-dummy -Ddummy.path=index.txt \
     crawldb/ segments/20170518173127/ -deleteGone
Segment dir is complete: segments/20170518173127.
Indexer: starting at 2017-05-18 17:38:52
Indexer: deleting gone documents: true
Indexer: URL filtering: false
Indexer: URL normalizing: false
Active IndexWriters :
DummyIndexWriter
        dummy.path : Path of the file to write to (mandatory)


Indexer: number of documents indexed, deleted, or skipped:
Indexer:      1  deleted (robots=noindex)
Indexer: finished at 2017-05-18 17:38:53, elapsed: 00:00:01

% cat index.txt
delete  https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/


Did you use -Dindexer.delete.robots.noindex=true in combination with -deleteGone?
Otherwise no "delete" actions are performed.
That's not really clear and also not handled the same way by all indexer plugins:
indexer-solr does not but indexer-elastic does without.


Best,
Sebastian


On 05/18/2017 02:43 PM, Eyeris Rodriguez Rueda wrote:
> Thanks Sebastian for your answer.
> 
> This is my environment
> I am using nutch 1.12  solr 4.10.3  in local mode and always use the comand bin/crawl for a complete cycle.
> For some reason all document with noindex meta are being indexed.
> 
> I have tested bin/nutch index and the document are indexed.
> 
> I have tested bin/nutch parsechecker and indexchecker with doIndex=true but the problem persist.
> 
> It looks like if nutch never read the property indexer.delete.robots.noindex in nutch-site.xml
> 
> I have read the method configure in IndexerMapReduce.java class and it has a line for that property but
> i dont understand why those document are indexed.
> 
> this.deleteRobotsNoIndex = job.getBoolean(INDEXER_DELETE_ROBOTS_NOINDEX,false);   (line 97)
> 
> 
> Please i really want to solve this situation, any advice or suggestion will be appreciated.
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> ----- Mensaje original -----
> De: "Sebastian Nagel" <wa...@googlemail.com>
> Para: user@nutch.apache.org
> Enviados: Jueves, 11 de Mayo 2017 10:05:35
> Asunto: [MASSMAIL]Re: problems with documents with noindex meta
> 
> Hi,
> 
> the indexing job ("bin/nutch index") will delete this document.
> But it looks like that it's not done by "bin/nutch indexchecker -DdoIndex=true"
> does not (cf. NUTCH-1758).
> 
> Please, note that "bin/nutch parsechecker" or "indexchecker" without "doIndex"
> will not send anything to the index.
> 
> Best,
> Sebastian
> 
> 
> On 05/10/2017 09:00 PM, Eyeris Rodriguez Rueda wrote:
>> Hi all.
>> I need some help with this problem, sorry if is a trivial things.
>> I have a little problem with some url that have noindex meta and are being indexed.
>>
>> For example this url:
>> https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
>>
>> have the meta noindex and for some reason it is not deleted as well and 
>> name="robots" content="noindex,follow"/>
>>
>> I have read that nutch should delete this document at the indexing time and it is not occurring correctly.
>>
>> <property>
>>   <name>indexer.delete.robots.noindexname>
>>   <value>truevalue>
>> property>
>>
>> If i do a parsechecker the output has an empty content but the document it is not deleted:
>>
>> fetching: https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
>> robots.txt whitelist not configured.
>> parsing: https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
>> contentType: text/html
>> date :        Wed May 10 14:21:36 CDT 2017
>> agent :        cubbot
>> type :        text/html
>> type :        text
>> type :        html
>> title :        3
>> url :        https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
>> content :        
>> tstamp :        Wed May 10 14:21:36 CDT 2017
>> domain :        uci.cu
>> digest :        25ed6b1b7be4cbb69a3405f5efe2f8a2
>> host :        humanos.uci.cu
>> name :        3
>> id :        https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
>> lang :        es
>>
>> Please any help or suggestion will be appreciated.
> 
> 
> ****************************************************
> Text below is autogenerated
> ***************************************************
> La @universidad_uci es Fidel. Los jóvenes no fallaremos.
> #HastaSiempreComandante
> #HastalaVictoriaSiempre
>

Re: [MASSMAIL]Re: problems with documents with noindex meta

Posted by Eyeris Rodriguez Rueda <er...@uci.cu>.

Thanks Sebastian for your answer.

This is my environment
I am using nutch 1.12  solr 4.10.3  in local mode and always use the comand bin/crawl for a complete cycle.
For some reason all document with noindex meta are being indexed.

I have tested bin/nutch index and the document are indexed.

I have tested bin/nutch parsechecker and indexchecker with doIndex=true but the problem persist.

It looks like if nutch never read the property indexer.delete.robots.noindex in nutch-site.xml

I have read the method configure in IndexerMapReduce.java class and it has a line for that property but
i dont understand why those document are indexed.

this.deleteRobotsNoIndex = job.getBoolean(INDEXER_DELETE_ROBOTS_NOINDEX,false);   (line 97)

Please i really want to solve this situation, any advice or suggestion will be appreciated.

----- Mensaje original -----
De: "Sebastian Nagel" <wa...@googlemail.com>
Para: user@nutch.apache.org
Enviados: Jueves, 11 de Mayo 2017 10:05:35
Asunto: [MASSMAIL]Re: problems with documents with noindex meta

Hi,

the indexing job ("bin/nutch index") will delete this document.
But it looks like that it's not done by "bin/nutch indexchecker -DdoIndex=true"
does not (cf. NUTCH-1758).

Please, note that "bin/nutch parsechecker" or "indexchecker" without "doIndex"
will not send anything to the index.

Best,
Sebastian

On 05/10/2017 09:00 PM, Eyeris Rodriguez Rueda wrote:
> Hi all.
> I need some help with this problem, sorry if is a trivial things.
> I have a little problem with some url that have noindex meta and are being indexed.
> 
> For example this url:
> https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
> 
> have the meta noindex and for some reason it is not deleted as well and 
> name="robots" content="noindex,follow"/>
> 
> I have read that nutch should delete this document at the indexing time and it is not occurring correctly.
> 
> <property>
>   <name>indexer.delete.robots.noindexname>
>   <value>truevalue>
> property>
> 
> If i do a parsechecker the output has an empty content but the document it is not deleted:
> 
> fetching: https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
> robots.txt whitelist not configured.
> parsing: https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
> contentType: text/html
> date :        Wed May 10 14:21:36 CDT 2017
> agent :        cubbot
> type :        text/html
> type :        text
> type :        html
> title :        3
> url :        https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
> content :        
> tstamp :        Wed May 10 14:21:36 CDT 2017
> domain :        uci.cu
> digest :        25ed6b1b7be4cbb69a3405f5efe2f8a2
> host :        humanos.uci.cu
> name :        3
> id :        https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
> lang :        es
> 
> Please any help or suggestion will be appreciated.

****************************************************
Text below is autogenerated
***************************************************
La @universidad_uci es Fidel. Los jóvenes no fallaremos.
#HastaSiempreComandante
#HastalaVictoriaSiempre

Re: [MASSMAIL]Re: problems with documents with noindex meta

Posted by Eyeris Rodriguez Rueda <er...@uci.cu>.

Thanks Sebastian for your answer.
I am using nutch 1.12 in local mode and always use the comand bin/crawl for a complete cycle.
For some reason all document with noindex meta are being indexed.
I have tested bin/nutch index and the document are indexed.

I have tested bin/nutch parsechecker and indexchecker with doIndex=true but the problem persist.

It looks like if nutch never read the property indexer.delete.robots.noindex in nutch-site.xml

I have read the method configure in IndexerMapReduce.java class and it has a line for that property but
i dont understand why those document are indexed.

this.deleteRobotsNoIndex = job.getBoolean(INDEXER_DELETE_ROBOTS_NOINDEX,false);   (line 97)

Please i really want to solve this situation, any advice or suggestion will be appreciated.

----- Mensaje original -----
De: "Sebastian Nagel" <wa...@googlemail.com>
Para: user@nutch.apache.org
Enviados: Jueves, 11 de Mayo 2017 10:05:35
Asunto: [MASSMAIL]Re: problems with documents with noindex meta

Hi,

the indexing job ("bin/nutch index") will delete this document.
But it looks like that it's not done by "bin/nutch indexchecker -DdoIndex=true"
does not (cf. NUTCH-1758).

Please, note that "bin/nutch parsechecker" or "indexchecker" without "doIndex"
will not send anything to the index.

Best,
Sebastian

On 05/10/2017 09:00 PM, Eyeris Rodriguez Rueda wrote:
> Hi all.
> I need some help with this problem, sorry if is a trivial things.
> I have a little problem with some url that have noindex meta and are being indexed.
> 
> For example this url:
> https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
> 
> have the meta noindex and for some reason it is not deleted as well and 
> <meta name="robots" content="noindex,follow"/>
> 
> I have read that nutch should delete this document at the indexing time and it is not occurring correctly.
> 
> <property>
>   <name>indexer.delete.robots.noindex</name>
>   <value>true</value>
> </property>
> 
> If i do a parsechecker the output has an empty content but the document it is not deleted:
> 
> fetching: https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
> robots.txt whitelist not configured.
> parsing: https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
> contentType: text/html
> date :	Wed May 10 14:21:36 CDT 2017
> agent :	cubbot
> type :	text/html
> type :	text
> type :	html
> title :	3
> url :	https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
> content :	
> tstamp :	Wed May 10 14:21:36 CDT 2017
> domain :	uci.cu
> digest :	25ed6b1b7be4cbb69a3405f5efe2f8a2
> host :	humanos.uci.cu
> name :	3
> id :	https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
> lang :	es
> 
> Please any help or suggestion will be appreciated.

****************************************************
Text below is autogenerated
***************************************************

La @universidad_uci es Fidel. Los jóvenes no fallaremos.
#HastaSiempreComandante
#HastalaVictoriaSiempre

Re: problems with documents with noindex meta

Posted by Sebastian Nagel <wa...@googlemail.com>.

Hi,

the indexing job ("bin/nutch index") will delete this document.
But it looks like that it's not done by "bin/nutch indexchecker -DdoIndex=true"
does not (cf. NUTCH-1758).

Please, note that "bin/nutch parsechecker" or "indexchecker" without "doIndex"
will not send anything to the index.

Best,
Sebastian


On 05/10/2017 09:00 PM, Eyeris Rodriguez Rueda wrote:
> Hi all.
> I need some help with this problem, sorry if is a trivial things.
> I have a little problem with some url that have noindex meta and are being indexed.
> 
> For example this url:
> https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
> 
> have the meta noindex and for some reason it is not deleted as well and 
> <meta name="robots" content="noindex,follow"/>
> 
> I have read that nutch should delete this document at the indexing time and it is not occurring correctly.
> 
> <property>
>   <name>indexer.delete.robots.noindex</name>
>   <value>true</value>
> </property>
> 
> If i do a parsechecker the output has an empty content but the document it is not deleted:
> 
> fetching: https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
> robots.txt whitelist not configured.
> parsing: https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
> contentType: text/html
> date :	Wed May 10 14:21:36 CDT 2017
> agent :	cubbot
> type :	text/html
> type :	text
> type :	html
> title :	3
> url :	https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
> content :	
> tstamp :	Wed May 10 14:21:36 CDT 2017
> domain :	uci.cu
> digest :	25ed6b1b7be4cbb69a3405f5efe2f8a2
> host :	humanos.uci.cu
> name :	3
> id :	https://humanos.uci.cu/category/humanos/comparte-tu-software/page/3/
> lang :	es
> 
> Please any help or suggestion will be appreciated.
> ****************************************************
> Text below is autogenerated
> ***************************************************
> La @universidad_uci es Fidel. Los jóvenes no fallaremos.
> #HastaSiempreComandante
> #HastalaVictoriaSiempre
>