You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Louis Keeble <lk...@yahoo.com> on 2014/05/12 20:36:51 UTC

Nutch with elasticsearch plugin not removing a deleted doc from the elasticsearch index




Hi all, 


I am using the elasticsearch nutch indexing plugin to index a web site and add search info to an elasticsearch index. It has been working well so far. As a test, I removed a single document from my a previously indexed web site and re-ran the nutch crawler on this web site. The web site correctly gave a HTTP 404 (deleted) status for the deleted document when it was fetched by nutch. The crawl seemed to finish successfully BUT the deleted document is still showing up in the elasticsearch index. I expected/hoped it would be deleted from the index.

Does anyone have any idea why the deleted (from web site) document is not being deleted from the Elasticsearch index? 


Here's what I see for this document when I dump nutch's most recent segment data:

Recno:: 136
URL::
 http://dahl/pages/Jim_Bloggs_School/_a_gust_vis_form

CrawlDatum::
Version: 7
Status: 2 (db_fetched)
Fetch time: Fri May 09 11:49:53 CDT 2014
Modified time: Wed Dec 31 18:00:00 CST 1969
Retries since fetch: 0
Retry interval: 3600 seconds (0 days)
Score: 0.006134969
Signature: ba168f4ecf34ccbb1adea384a5f5a78d
Metadata:
        _ngt_=1399668503370
        Content-Type=text/html
        _pst_=success(1), lastModified=0
        _rs_=154

CrawlDatum::
Version: 7
Status: 37 (fetch_gone)
Fetch time: Fri May 09 15:49:33 CDT 2014
Modified time: Wed Dec 31 18:00:00 CST 1969
Retries since fetch: 0
Retry interval: 3600 seconds (0 days)
Score: 0.006134969
Signature: ba168f4ecf34ccbb1adea384a5f5a78d
Metadata:
        _ngt_=1399668503370
        Content-Type=text/html
        _pst_=notfound(14), lastModified=0: http://dahl/pages/Jim_Bloggs_School/_a_gust_vis_form
        _rs_=6748


I see that there are two "CrawlDatum"' records, one has a status of 2 (db_fetched) and the other has a status of 37 (fetch_gone). The indexing data looked like it was sent to elasticsearch successfully based on the hadoop.log, but there isn't a lot of information provided in hadoop.log for elasticsearch.
Thanks,


 
-Lou

Re: Nutch with elasticsearch plugin not removing a deleted doc from the elasticsearch index

Posted by Louis Keeble <lk...@yahoo.com>.
Just a follow-up to my previous post.

I have confirmed that setting the following parameter in nutch-site.xml causes documents deleted from a web site (returning HTTP 404) to *not* be removed from ElasticSearch on a re-crawl:

db.update.purge.404 = "true"
-Lou


________________________________
 From: Julien Nioche <li...@gmail.com>
To: "user@nutch.apache.org" <us...@nutch.apache.org>; Louis Keeble <lk...@yahoo.com> 
Sent: Wednesday, May 21, 2014 8:15 AM
Subject: Re: Nutch with elasticsearch plugin not removing a deleted doc from the elasticsearch index
 

Hi Lou

Where would I put the -deleteGone parameter? Just add another parameter to
> the end ?  I saw online that -deleteGone is a valid parameter of the
> bin/nutch command but am not sure about bin/crawl. Maybe I need to run
> bin/nutch for this?


Just modify the crawl script and add the -deleteGone parameter to the index
step :

  $bin/nutch index -D solr.server.url=$SOLRURL $CRAWL_PATH/crawldb -linkdb
$CRAWL_PATH/linkdb $CRAWL_PATH/segments/$SEGMENT *-deleteGone*

HTH

Julien





On 20 May 2014 23:00, Louis Keeble <lk...@yahoo.com> wrote:

> Hi Julien,
>
>
>
> I am using Nutch 1.9 with ElasticSearch 0.9. (via the plugin)
>
>
> When the site is initially indexed I see that the crawldb contains the
> soon-to-be-deleted URL as follows:
>
> http://dahl/pages/Jim_Bloggs_School/to_be_deleted_aardvark      Version: 7
> Status: 2 (db_fetched)
> Fetch time: Tue May 20 11:49:53 CDT 2014
> Modified time: Wed Dec 31 18:00:00 CST 1969
> Retries since fetch: 0
> Retry interval: 900 seconds (0 days)
> Score: 0.0060975607
> Signature: ecad1fb13879445c07ca3c9b302077a4
> Metadata:
>         Content-Type=text/html
>         _pst_=success(1), lastModified=0
>         _rs_=478
>
>
> When I delete the document and rerun the crawl I no longer see any record
> of the deleted document in the new crawldb dump.
> However the document remains in the ElasticSearch index.
>
> Note: the method used is to generate a web page with only the documents
> that I want to be indexed and use that page as the initial seed page.
> So, the second time the crawler runs, it sees a seed page *without* my
> deleted document.
> However, I assume that the retry interval of 900 seconds (see crawldb
> record above) means that nutch will try to refetch the deleted document, at
> which point it will get a 404 (deleted) from the web server.
>
>
> I have the following settings in my nutch-site.xml file (among other
> settings):
>
>
>     "link.delete.gone"    :  "true",
>      "db.update.purge.404"          : "true"
> (Don't worry about the non-XML formatting, this is JSON but it gets
> translated to XML during a pre-processing step).
>
>
> ** I am not currently using the -deleteGone parameter anywhere.  **
>
> I am using the bin/crawl all-in-one script, something like this:
>
> bin/crawl <seed_url_folder> <path_to_crawldb_files> -depth 2 -topN 10000
>
>
> Where would I put the -deleteGone parameter? Just add another parameter to
> the end ?  I saw online that -deleteGone is a valid parameter of the
> bin/nutch command but am not sure about bin/crawl. Maybe I need to run
> bin/nutch for this?
>
>
> Thanks for your help!
>
>
>
> -Lou
>
>
> ________________________________
>  From: Julien Nioche <li...@gmail.com>
> To: "user@nutch.apache.org" <us...@nutch.apache.org>; Louis Keeble <
> lkeeble@yahoo.com>
> Sent: Monday, May 19, 2014 8:03 AM
> Subject: Re: Nutch with elasticsearch plugin not removing a deleted doc
> from the elasticsearch index
>
>
> Hi Louis
>
> What do you get in the crawldb for that URL? Which version of Nutch are you
> using?
>
> The indexer takes a -deleteGone parameter, are you using it?
>
> Julien
>
>
>
>
>
> On 12 May 2014 19:36, Louis Keeble <lk...@yahoo.com> wrote:
>
> >
> >
> >
> >
> > Hi all,
> >
> >
> > I am using the elasticsearch nutch indexing plugin to index a web site
> and
> > add search info to an elasticsearch index. It has been working well so
> far.
> > As a test, I removed a single document from my a previously indexed web
> > site and re-ran the nutch crawler on this web site. The web site
> correctly
> > gave a HTTP 404 (deleted) status for the deleted document when it was
> > fetched by nutch. The crawl seemed to finish successfully BUT the deleted
> > document is still showing up in the elasticsearch index. I expected/hoped
> > it would be deleted from the index.
> >
> > Does anyone have any idea why the deleted (from web site) document is not
> > being deleted from the Elasticsearch index?
> >
> >
> > Here's what I see for this document when I dump nutch's most recent
> > segment data:
> >
> > Recno:: 136
> > URL::
> >  http://dahl/pages/Jim_Bloggs_School/_a_gust_vis_form
> >
> > CrawlDatum::
> > Version: 7
> > Status: 2 (db_fetched)
> > Fetch time: Fri May 09 11:49:53 CDT 2014
> > Modified time: Wed Dec 31 18:00:00 CST 1969
> > Retries since fetch: 0
> > Retry interval: 3600 seconds (0 days)
> > Score: 0.006134969
> > Signature: ba168f4ecf34ccbb1adea384a5f5a78d
> > Metadata:
> >         _ngt_=1399668503370
> >         Content-Type=text/html
> >         _pst_=success(1), lastModified=0
> >         _rs_=154
> >
> > CrawlDatum::
> > Version: 7
> > Status: 37 (fetch_gone)
> > Fetch time: Fri May 09 15:49:33 CDT 2014
> > Modified time: Wed Dec 31 18:00:00 CST 1969
> > Retries since fetch: 0
> > Retry interval: 3600 seconds (0 days)
> > Score: 0.006134969
> > Signature: ba168f4ecf34ccbb1adea384a5f5a78d
> > Metadata:
> >         _ngt_=1399668503370
> >         Content-Type=text/html
> >         _pst_=notfound(14), lastModified=0:
> > http://dahl/pages/Jim_Bloggs_School/_a_gust_vis_form
> >         _rs_=6748
> >
> >
> > I see that there are two "CrawlDatum"' records, one has a status of 2
> > (db_fetched) and the other has a status of 37 (fetch_gone). The indexing
> > data looked like it was sent to elasticsearch successfully based on the
> > hadoop.log, but there isn't a lot of information provided in hadoop.log
> for
> > elasticsearch.
> > Thanks,
> >
> >
> >
> > -Lou
>
>
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble

>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Nutch with elasticsearch plugin not removing a deleted doc from the elasticsearch index

Posted by Louis Keeble <lk...@yahoo.com>.
Thanks for the suggestion Julien. 


I made some progress on this problem :) 


I had been running with the following non-default settings in my nutch-site.xml configuration:

link.delete.gone = true
db.update.purge.404 = true

both of these are defaulted to false in nutch-default.xml. I had set those believing that they would ensure that documents deleted from my web site would then be deleted from ElasticSearch. 

It seems the opposite is true...


As a test, I removed those two settings from nutch-site.xml. After doing this, deleted (HTTP 404) documents are now removed from ElasticSearch on a re-crawl.

I have been running the all-in-one bin/crawl script, I did not need to change anything there. 

I noticed that inside /bin/crawl the bin/nutch index command is called as follows:

 $bin/nutch index $CRAWL_PATH/crawldb -linkdb $CRAWL_PATH/linkdb $CRAWL_PATH/segments/$SEGMENT

At first I did try adding the -deleteGone parameter (modifying the crawl script) but found that this did not fix the issue.

As I mentioned above, removing those two parameters solved the issue for me.

My theory as to what is happening is that when db.update.purge.404 is set to true then "gone documents" are purged from the crawldb database. Then, when the update to ElasticSearch is done there is no longer any record of the "gone documents" and therefore they are not removed from ElasticSearch. I suspect that the link.delete.gone = true setting has no effect on the deletion of documents from ElasticSearch - but I haven't verified this yet. 

Thanks for your help,

 
-Lou


________________________________
 From: Julien Nioche <li...@gmail.com>
To: "user@nutch.apache.org" <us...@nutch.apache.org>; Louis Keeble <lk...@yahoo.com> 
Sent: Wednesday, May 21, 2014 8:15 AM
Subject: Re: Nutch with elasticsearch plugin not removing a deleted doc from the elasticsearch index
 

Hi Lou

Where would I put the -deleteGone parameter? Just add another parameter to
> the end ?  I saw online that -deleteGone is a valid parameter of the
> bin/nutch command but am not sure about bin/crawl. Maybe I need to run
> bin/nutch for this?


Just modify the crawl script and add the -deleteGone parameter to the index
step :

  $bin/nutch index -D solr.server.url=$SOLRURL $CRAWL_PATH/crawldb -linkdb
$CRAWL_PATH/linkdb $CRAWL_PATH/segments/$SEGMENT *-deleteGone*

HTH

Julien





On 20 May 2014 23:00, Louis Keeble <lk...@yahoo.com> wrote:

> Hi Julien,
>
>
>
> I am using Nutch 1.9 with ElasticSearch 0.9. (via the plugin)
>
>
> When the site is initially indexed I see that the crawldb contains the
> soon-to-be-deleted URL as follows:
>
> http://dahl/pages/Jim_Bloggs_School/to_be_deleted_aardvark      Version: 7
> Status: 2 (db_fetched)
> Fetch time: Tue May 20 11:49:53 CDT 2014
> Modified time: Wed Dec 31 18:00:00 CST 1969
> Retries since fetch: 0
> Retry interval: 900 seconds (0 days)
> Score: 0.0060975607
> Signature: ecad1fb13879445c07ca3c9b302077a4
> Metadata:
>         Content-Type=text/html
>         _pst_=success(1), lastModified=0
>         _rs_=478
>
>
> When I delete the document and rerun the crawl I no longer see any record
> of the deleted document in the new crawldb dump.
> However the document remains in the ElasticSearch index.
>
> Note: the method used is to generate a web page with only the documents
> that I want to be indexed and use that page as the initial seed page.
> So, the second time the crawler runs, it sees a seed page *without* my
> deleted document.
> However, I assume that the retry interval of 900 seconds (see crawldb
> record above) means that nutch will try to refetch the deleted document, at
> which point it will get a 404 (deleted) from the web server.
>
>
> I have the following settings in my nutch-site.xml file (among other
> settings):
>
>
>     "link.delete.gone"    :  "true",
>      "db.update.purge.404"          : "true"
> (Don't worry about the non-XML formatting, this is JSON but it gets
> translated to XML during a pre-processing step).
>
>
> ** I am not currently using the -deleteGone parameter anywhere.  **
>
> I am using the bin/crawl all-in-one script, something like this:
>
> bin/crawl <seed_url_folder> <path_to_crawldb_files> -depth 2 -topN 10000
>
>
> Where would I put the -deleteGone parameter? Just add another parameter to
> the end ?  I saw online that -deleteGone is a valid parameter of the
> bin/nutch command but am not sure about bin/crawl. Maybe I need to run
> bin/nutch for this?
>
>
> Thanks for your help!
>
>
>
> -Lou
>
>
> ________________________________
>  From: Julien Nioche <li...@gmail.com>
> To: "user@nutch.apache.org" <us...@nutch.apache.org>; Louis Keeble <
> lkeeble@yahoo.com>
> Sent: Monday, May 19, 2014 8:03 AM
> Subject: Re: Nutch with elasticsearch plugin not removing a deleted doc
> from the elasticsearch index
>
>
> Hi Louis
>
> What do you get in the crawldb for that URL? Which version of Nutch are you
> using?
>
> The indexer takes a -deleteGone parameter, are you using it?
>
> Julien
>
>
>
>
>
> On 12 May 2014 19:36, Louis Keeble <lk...@yahoo.com> wrote:
>
> >
> >
> >
> >
> > Hi all,
> >
> >
> > I am using the elasticsearch nutch indexing plugin to index a web site
> and
> > add search info to an elasticsearch index. It has been working well so
> far.
> > As a test, I removed a single document from my a previously indexed web
> > site and re-ran the nutch crawler on this web site. The web site
> correctly
> > gave a HTTP 404 (deleted) status for the deleted document when it was
> > fetched by nutch. The crawl seemed to finish successfully BUT the deleted
> > document is still showing up in the elasticsearch index. I expected/hoped
> > it would be deleted from the index.
> >
> > Does anyone have any idea why the deleted (from web site) document is not
> > being deleted from the Elasticsearch index?
> >
> >
> > Here's what I see for this document when I dump nutch's most recent
> > segment data:
> >
> > Recno:: 136
> > URL::
> >  http://dahl/pages/Jim_Bloggs_School/_a_gust_vis_form
> >
> > CrawlDatum::
> > Version: 7
> > Status: 2 (db_fetched)
> > Fetch time: Fri May 09 11:49:53 CDT 2014
> > Modified time: Wed Dec 31 18:00:00 CST 1969
> > Retries since fetch: 0
> > Retry interval: 3600 seconds (0 days)
> > Score: 0.006134969
> > Signature: ba168f4ecf34ccbb1adea384a5f5a78d
> > Metadata:
> >         _ngt_=1399668503370
> >         Content-Type=text/html
> >         _pst_=success(1), lastModified=0
> >         _rs_=154
> >
> > CrawlDatum::
> > Version: 7
> > Status: 37 (fetch_gone)
> > Fetch time: Fri May 09 15:49:33 CDT 2014
> > Modified time: Wed Dec 31 18:00:00 CST 1969
> > Retries since fetch: 0
> > Retry interval: 3600 seconds (0 days)
> > Score: 0.006134969
> > Signature: ba168f4ecf34ccbb1adea384a5f5a78d
> > Metadata:
> >         _ngt_=1399668503370
> >         Content-Type=text/html
> >         _pst_=notfound(14), lastModified=0:
> > http://dahl/pages/Jim_Bloggs_School/_a_gust_vis_form
> >         _rs_=6748
> >
> >
> > I see that there are two "CrawlDatum"' records, one has a status of 2
> > (db_fetched) and the other has a status of 37 (fetch_gone). The indexing
> > data looked like it was sent to elasticsearch successfully based on the
> > hadoop.log, but there isn't a lot of information provided in hadoop.log
> for
> > elasticsearch.
> > Thanks,
> >
> >
> >
> > -Lou
>
>
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble

>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Nutch with elasticsearch plugin not removing a deleted doc from the elasticsearch index

Posted by Julien Nioche <li...@gmail.com>.
Hi Lou

Where would I put the -deleteGone parameter? Just add another parameter to
> the end ?  I saw online that -deleteGone is a valid parameter of the
> bin/nutch command but am not sure about bin/crawl. Maybe I need to run
> bin/nutch for this?


Just modify the crawl script and add the -deleteGone parameter to the index
step :

  $bin/nutch index -D solr.server.url=$SOLRURL $CRAWL_PATH/crawldb -linkdb
$CRAWL_PATH/linkdb $CRAWL_PATH/segments/$SEGMENT *-deleteGone*

HTH

Julien





On 20 May 2014 23:00, Louis Keeble <lk...@yahoo.com> wrote:

> Hi Julien,
>
>
>
> I am using Nutch 1.9 with ElasticSearch 0.9. (via the plugin)
>
>
> When the site is initially indexed I see that the crawldb contains the
> soon-to-be-deleted URL as follows:
>
> http://dahl/pages/Jim_Bloggs_School/to_be_deleted_aardvark      Version: 7
> Status: 2 (db_fetched)
> Fetch time: Tue May 20 11:49:53 CDT 2014
> Modified time: Wed Dec 31 18:00:00 CST 1969
> Retries since fetch: 0
> Retry interval: 900 seconds (0 days)
> Score: 0.0060975607
> Signature: ecad1fb13879445c07ca3c9b302077a4
> Metadata:
>         Content-Type=text/html
>         _pst_=success(1), lastModified=0
>         _rs_=478
>
>
> When I delete the document and rerun the crawl I no longer see any record
> of the deleted document in the new crawldb dump.
> However the document remains in the ElasticSearch index.
>
> Note: the method used is to generate a web page with only the documents
> that I want to be indexed and use that page as the initial seed page.
> So, the second time the crawler runs, it sees a seed page *without* my
> deleted document.
> However, I assume that the retry interval of 900 seconds (see crawldb
> record above) means that nutch will try to refetch the deleted document, at
> which point it will get a 404 (deleted) from the web server.
>
>
> I have the following settings in my nutch-site.xml file (among other
> settings):
>
>
>     "link.delete.gone"    :  "true",
>      "db.update.purge.404"          : "true"
> (Don't worry about the non-XML formatting, this is JSON but it gets
> translated to XML during a pre-processing step).
>
>
> ** I am not currently using the -deleteGone parameter anywhere.  **
>
> I am using the bin/crawl all-in-one script, something like this:
>
> bin/crawl <seed_url_folder> <path_to_crawldb_files> -depth 2 -topN 10000
>
>
> Where would I put the -deleteGone parameter? Just add another parameter to
> the end ?  I saw online that -deleteGone is a valid parameter of the
> bin/nutch command but am not sure about bin/crawl. Maybe I need to run
> bin/nutch for this?
>
>
> Thanks for your help!
>
>
>
> -Lou
>
>
> ________________________________
>  From: Julien Nioche <li...@gmail.com>
> To: "user@nutch.apache.org" <us...@nutch.apache.org>; Louis Keeble <
> lkeeble@yahoo.com>
> Sent: Monday, May 19, 2014 8:03 AM
> Subject: Re: Nutch with elasticsearch plugin not removing a deleted doc
> from the elasticsearch index
>
>
> Hi Louis
>
> What do you get in the crawldb for that URL? Which version of Nutch are you
> using?
>
> The indexer takes a -deleteGone parameter, are you using it?
>
> Julien
>
>
>
>
>
> On 12 May 2014 19:36, Louis Keeble <lk...@yahoo.com> wrote:
>
> >
> >
> >
> >
> > Hi all,
> >
> >
> > I am using the elasticsearch nutch indexing plugin to index a web site
> and
> > add search info to an elasticsearch index. It has been working well so
> far.
> > As a test, I removed a single document from my a previously indexed web
> > site and re-ran the nutch crawler on this web site. The web site
> correctly
> > gave a HTTP 404 (deleted) status for the deleted document when it was
> > fetched by nutch. The crawl seemed to finish successfully BUT the deleted
> > document is still showing up in the elasticsearch index. I expected/hoped
> > it would be deleted from the index.
> >
> > Does anyone have any idea why the deleted (from web site) document is not
> > being deleted from the Elasticsearch index?
> >
> >
> > Here's what I see for this document when I dump nutch's most recent
> > segment data:
> >
> > Recno:: 136
> > URL::
> >  http://dahl/pages/Jim_Bloggs_School/_a_gust_vis_form
> >
> > CrawlDatum::
> > Version: 7
> > Status: 2 (db_fetched)
> > Fetch time: Fri May 09 11:49:53 CDT 2014
> > Modified time: Wed Dec 31 18:00:00 CST 1969
> > Retries since fetch: 0
> > Retry interval: 3600 seconds (0 days)
> > Score: 0.006134969
> > Signature: ba168f4ecf34ccbb1adea384a5f5a78d
> > Metadata:
> >         _ngt_=1399668503370
> >         Content-Type=text/html
> >         _pst_=success(1), lastModified=0
> >         _rs_=154
> >
> > CrawlDatum::
> > Version: 7
> > Status: 37 (fetch_gone)
> > Fetch time: Fri May 09 15:49:33 CDT 2014
> > Modified time: Wed Dec 31 18:00:00 CST 1969
> > Retries since fetch: 0
> > Retry interval: 3600 seconds (0 days)
> > Score: 0.006134969
> > Signature: ba168f4ecf34ccbb1adea384a5f5a78d
> > Metadata:
> >         _ngt_=1399668503370
> >         Content-Type=text/html
> >         _pst_=notfound(14), lastModified=0:
> > http://dahl/pages/Jim_Bloggs_School/_a_gust_vis_form
> >         _rs_=6748
> >
> >
> > I see that there are two "CrawlDatum"' records, one has a status of 2
> > (db_fetched) and the other has a status of 37 (fetch_gone). The indexing
> > data looked like it was sent to elasticsearch successfully based on the
> > hadoop.log, but there isn't a lot of information provided in hadoop.log
> for
> > elasticsearch.
> > Thanks,
> >
> >
> >
> > -Lou
>
>
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Nutch with elasticsearch plugin not removing a deleted doc from the elasticsearch index

Posted by Louis Keeble <lk...@yahoo.com>.
Hi Julien, 



I am using Nutch 1.9 with ElasticSearch 0.9. (via the plugin)


When the site is initially indexed I see that the crawldb contains the soon-to-be-deleted URL as follows:

http://dahl/pages/Jim_Bloggs_School/to_be_deleted_aardvark      Version: 7
Status: 2 (db_fetched)
Fetch time: Tue May 20 11:49:53 CDT 2014
Modified time: Wed Dec 31 18:00:00 CST 1969
Retries since fetch: 0
Retry interval: 900 seconds (0 days)
Score: 0.0060975607
Signature: ecad1fb13879445c07ca3c9b302077a4
Metadata:
        Content-Type=text/html
        _pst_=success(1), lastModified=0
        _rs_=478


When I delete the document and rerun the crawl I no longer see any record of the deleted document in the new crawldb dump.
However the document remains in the ElasticSearch index.

Note: the method used is to generate a web page with only the documents that I want to be indexed and use that page as the initial seed page.
So, the second time the crawler runs, it sees a seed page *without* my deleted document.
However, I assume that the retry interval of 900 seconds (see crawldb record above) means that nutch will try to refetch the deleted document, at which point it will get a 404 (deleted) from the web server. 


I have the following settings in my nutch-site.xml file (among other settings):


    "link.delete.gone"    :  "true",
     "db.update.purge.404"          : "true"
(Don't worry about the non-XML formatting, this is JSON but it gets translated to XML during a pre-processing step).


** I am not currently using the -deleteGone parameter anywhere.  **

I am using the bin/crawl all-in-one script, something like this:

bin/crawl <seed_url_folder> <path_to_crawldb_files> -depth 2 -topN 10000


Where would I put the -deleteGone parameter? Just add another parameter to the end ?  I saw online that -deleteGone is a valid parameter of the bin/nutch command but am not sure about bin/crawl. Maybe I need to run bin/nutch for this?


Thanks for your help!


 
-Lou


________________________________
 From: Julien Nioche <li...@gmail.com>
To: "user@nutch.apache.org" <us...@nutch.apache.org>; Louis Keeble <lk...@yahoo.com> 
Sent: Monday, May 19, 2014 8:03 AM
Subject: Re: Nutch with elasticsearch plugin not removing a deleted doc from the elasticsearch index
 

Hi Louis

What do you get in the crawldb for that URL? Which version of Nutch are you
using?

The indexer takes a -deleteGone parameter, are you using it?

Julien





On 12 May 2014 19:36, Louis Keeble <lk...@yahoo.com> wrote:

>
>
>
>
> Hi all,
>
>
> I am using the elasticsearch nutch indexing plugin to index a web site and
> add search info to an elasticsearch index. It has been working well so far.
> As a test, I removed a single document from my a previously indexed web
> site and re-ran the nutch crawler on this web site. The web site correctly
> gave a HTTP 404 (deleted) status for the deleted document when it was
> fetched by nutch. The crawl seemed to finish successfully BUT the deleted
> document is still showing up in the elasticsearch index. I expected/hoped
> it would be deleted from the index.
>
> Does anyone have any idea why the deleted (from web site) document is not
> being deleted from the Elasticsearch index?
>
>
> Here's what I see for this document when I dump nutch's most recent
> segment data:
>
> Recno:: 136
> URL::
>  http://dahl/pages/Jim_Bloggs_School/_a_gust_vis_form
>
> CrawlDatum::
> Version: 7
> Status: 2 (db_fetched)
> Fetch time: Fri May 09 11:49:53 CDT 2014
> Modified time: Wed Dec 31 18:00:00 CST 1969
> Retries since fetch: 0
> Retry interval: 3600 seconds (0 days)
> Score: 0.006134969
> Signature: ba168f4ecf34ccbb1adea384a5f5a78d
> Metadata:
>         _ngt_=1399668503370
>         Content-Type=text/html
>         _pst_=success(1), lastModified=0
>         _rs_=154
>
> CrawlDatum::
> Version: 7
> Status: 37 (fetch_gone)
> Fetch time: Fri May 09 15:49:33 CDT 2014
> Modified time: Wed Dec 31 18:00:00 CST 1969
> Retries since fetch: 0
> Retry interval: 3600 seconds (0 days)
> Score: 0.006134969
> Signature: ba168f4ecf34ccbb1adea384a5f5a78d
> Metadata:
>         _ngt_=1399668503370
>         Content-Type=text/html
>         _pst_=notfound(14), lastModified=0:
> http://dahl/pages/Jim_Bloggs_School/_a_gust_vis_form
>         _rs_=6748
>
>
> I see that there are two "CrawlDatum"' records, one has a status of 2
> (db_fetched) and the other has a status of 37 (fetch_gone). The indexing
> data looked like it was sent to elasticsearch successfully based on the
> hadoop.log, but there isn't a lot of information provided in hadoop.log for
> elasticsearch.
> Thanks,
>
>
>
> -Lou




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Nutch with elasticsearch plugin not removing a deleted doc from the elasticsearch index

Posted by Julien Nioche <li...@gmail.com>.
Hi Louis

What do you get in the crawldb for that URL? Which version of Nutch are you
using?

The indexer takes a -deleteGone parameter, are you using it?

Julien




On 12 May 2014 19:36, Louis Keeble <lk...@yahoo.com> wrote:

>
>
>
>
> Hi all,
>
>
> I am using the elasticsearch nutch indexing plugin to index a web site and
> add search info to an elasticsearch index. It has been working well so far.
> As a test, I removed a single document from my a previously indexed web
> site and re-ran the nutch crawler on this web site. The web site correctly
> gave a HTTP 404 (deleted) status for the deleted document when it was
> fetched by nutch. The crawl seemed to finish successfully BUT the deleted
> document is still showing up in the elasticsearch index. I expected/hoped
> it would be deleted from the index.
>
> Does anyone have any idea why the deleted (from web site) document is not
> being deleted from the Elasticsearch index?
>
>
> Here's what I see for this document when I dump nutch's most recent
> segment data:
>
> Recno:: 136
> URL::
>  http://dahl/pages/Jim_Bloggs_School/_a_gust_vis_form
>
> CrawlDatum::
> Version: 7
> Status: 2 (db_fetched)
> Fetch time: Fri May 09 11:49:53 CDT 2014
> Modified time: Wed Dec 31 18:00:00 CST 1969
> Retries since fetch: 0
> Retry interval: 3600 seconds (0 days)
> Score: 0.006134969
> Signature: ba168f4ecf34ccbb1adea384a5f5a78d
> Metadata:
>         _ngt_=1399668503370
>         Content-Type=text/html
>         _pst_=success(1), lastModified=0
>         _rs_=154
>
> CrawlDatum::
> Version: 7
> Status: 37 (fetch_gone)
> Fetch time: Fri May 09 15:49:33 CDT 2014
> Modified time: Wed Dec 31 18:00:00 CST 1969
> Retries since fetch: 0
> Retry interval: 3600 seconds (0 days)
> Score: 0.006134969
> Signature: ba168f4ecf34ccbb1adea384a5f5a78d
> Metadata:
>         _ngt_=1399668503370
>         Content-Type=text/html
>         _pst_=notfound(14), lastModified=0:
> http://dahl/pages/Jim_Bloggs_School/_a_gust_vis_form
>         _rs_=6748
>
>
> I see that there are two "CrawlDatum"' records, one has a status of 2
> (db_fetched) and the other has a status of 37 (fetch_gone). The indexing
> data looked like it was sent to elasticsearch successfully based on the
> hadoop.log, but there isn't a lot of information provided in hadoop.log for
> elasticsearch.
> Thanks,
>
>
>
> -Lou




-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble