You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Ben Vachon <bv...@attivio.com> on 2017/05/15 19:35:35 UTC

delete STATUS_GONE pages from index

Hi all,

I'm working with Nutch 2.3.1 and I have a problem that I'm hoping the 
community can help me with.
A page is fetched successfully and subsequently indexed during the 
initial run of a crawler, but later, the page no longer exists on the 
server (404 not found). When I run the crawler again to update the 
index, I would like my IndexWriter to delete the document for this page.
I have the necessary code for this in my IndexWriter, but pages that are 
not successfully fetched are not successfully parsed and therefore never 
even reach my IndexFilters let alone the IndexWriter.
The page is ignored instead of deleted.
Any tips for handling this?

Thanks,

Ben V.


Re: delete STATUS_GONE pages from index

Posted by Ben Vachon <bv...@attivio.com>.
Thanks Tom,

I don't see that property in my nutch-defaults. I think it's probably 
from an older version.

I'm just gonna write a util method to clean them up that queries the 
gora store and deletes the matches from the index.


On 05/16/2017 04:12 AM, Tom Chiverton wrote:
> Do you need to set
>
> db.update.purge.404=true
>
> ?
>
> Tom
>
>
> On 15/05/17 20:35, Ben Vachon wrote:
>> Hi all,
>>
>> I'm working with Nutch 2.3.1 and I have a problem that I'm hoping the 
>> community can help me with.
>> A page is fetched successfully and subsequently indexed during the 
>> initial run of a crawler, but later, the page no longer exists on the 
>> server (404 not found). When I run the crawler again to update the 
>> index, I would like my IndexWriter to delete the document for this page.
>> I have the necessary code for this in my IndexWriter, but pages that 
>> are not successfully fetched are not successfully parsed and 
>> therefore never even reach my IndexFilters let alone the IndexWriter.
>> The page is ignored instead of deleted.
>> Any tips for handling this?
>>
>> Thanks,
>>
>> Ben V.
>>
>>
>> ______________________________________________________________________
>> This email has been scanned by the Symantec Email Security.cloud 
>> service.
>> For more information please visit http://www.symanteccloud.com
>> ______________________________________________________________________
>>
>
>


Re: delete STATUS_GONE pages from index

Posted by Tom Chiverton <tc...@extravision.com>.
Do you need to set

db.update.purge.404=true

?

Tom


On 15/05/17 20:35, Ben Vachon wrote:
> Hi all,
>
> I'm working with Nutch 2.3.1 and I have a problem that I'm hoping the 
> community can help me with.
> A page is fetched successfully and subsequently indexed during the 
> initial run of a crawler, but later, the page no longer exists on the 
> server (404 not found). When I run the crawler again to update the 
> index, I would like my IndexWriter to delete the document for this page.
> I have the necessary code for this in my IndexWriter, but pages that 
> are not successfully fetched are not successfully parsed and therefore 
> never even reach my IndexFilters let alone the IndexWriter.
> The page is ignored instead of deleted.
> Any tips for handling this?
>
> Thanks,
>
> Ben V.
>
>
> ______________________________________________________________________
> This email has been scanned by the Symantec Email Security.cloud service.
> For more information please visit http://www.symanteccloud.com
> ______________________________________________________________________
>