You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Tim Pease (JIRA)" <ji...@apache.org> on 2011/07/15 00:00:05 UTC

[jira] [Created] (NUTCH-1052) Multiple delete of the same URL using SolrClean

Multiple delete of the same URL using SolrClean
-----------------------------------------------

                 Key: NUTCH-1052
                 URL: https://issues.apache.org/jira/browse/NUTCH-1052
             Project: Nutch
          Issue Type: Improvement
          Components: indexer
    Affects Versions: 1.3, 1.4
            Reporter: Tim Pease
            Priority: Minor


The SolrClean class does not keep track of purged URLs, it only checks the URL status for "db_gone". When run multiple times the same list of URLs will be deleted from Solr. For small, stable crawl databases this is not a problem. For larger crawls this could be an issue. SolrClean will become an expensive operation.

One solution is to add a "purged" flag in the CrawlDatum metadata. SolrClean would then check this flag in addition to the "db_gone" status before adding the URL to the delete list.

Another solution is to add a new state to the status field "db_gone_and_purged".

Either way, the crawl DB will need to be updated after the Solr delete has successfully occurred.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1052) Multiple deletes of the same URL using SolrClean

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13108763#comment-13108763 ] 

Markus Jelsma commented on NUTCH-1052:
--------------------------------------

Thank, I already did :) I now write the action as single byte and use doc.write(out) to write the document itself. Seems to work at compile time and when running locally. Although i think when running locally the write and readFields methods are never called, i atleast don't get a runtime error.

I'll try running it on the cluster tomorrow orso.

> Multiple deletes of the same URL using SolrClean
> ------------------------------------------------
>
>                 Key: NUTCH-1052
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1052
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.3, 1.4
>            Reporter: Tim Pease
>            Assignee: Julien Nioche
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1052-1.4-1.patch, NUTCH-1052-1.4-2.patch, NUTCH-1052-1.4-3.patch
>
>
> The SolrClean class does not keep track of purged URLs, it only checks the URL status for "db_gone". When run multiple times the same list of URLs will be deleted from Solr. For small, stable crawl databases this is not a problem. For larger crawls this could be an issue. SolrClean will become an expensive operation.
> One solution is to add a "purged" flag in the CrawlDatum metadata. SolrClean would then check this flag in addition to the "db_gone" status before adding the URL to the delete list.
> Another solution is to add a new state to the status field "db_gone_and_purged".
> Either way, the crawl DB will need to be updated after the Solr delete has successfully occurred.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1052) Multiple deletes of the same URL using SolrClean

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13097796#comment-13097796 ] 

Markus Jelsma commented on NUTCH-1052:
--------------------------------------

Perhaps an even better solution is to keep SolrClean as a tool to purge the whole index of all 404's existing in the CrawlDB and change the indexers to optionally send delete commands. 
IndexerMapReduce skips non FETCH_SUCCESS CrawlDatum's. If can be modified to pass those records. Only problem is how get a delete flag in the SolrWriter; it takes only a NutchDocument object.

With this approach we make the SolrClean tool obsolete in regular crawl cycles which makes a huge difference with large indexes and CrawlDB.

Comments?!

> Multiple deletes of the same URL using SolrClean
> ------------------------------------------------
>
>                 Key: NUTCH-1052
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1052
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.3, 1.4
>            Reporter: Tim Pease
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>
> The SolrClean class does not keep track of purged URLs, it only checks the URL status for "db_gone". When run multiple times the same list of URLs will be deleted from Solr. For small, stable crawl databases this is not a problem. For larger crawls this could be an issue. SolrClean will become an expensive operation.
> One solution is to add a "purged" flag in the CrawlDatum metadata. SolrClean would then check this flag in addition to the "db_gone" status before adding the URL to the delete list.
> Another solution is to add a new state to the status field "db_gone_and_purged".
> Either way, the crawl DB will need to be updated after the Solr delete has successfully occurred.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1052) Multiple deletes of the same URL using SolrClean

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13108731#comment-13108731 ] 

Markus Jelsma commented on NUTCH-1052:
--------------------------------------

I see. I did a quick modification and came up with this (ditched the enum and used static final byte instead):

{code}
package org.apache.nutch.indexer;

class NutchIndexAction {

  public static final byte ADD = 0;
  public static final byte DELETE = 1;

  public NutchDocument doc = null;
  public byte action = 0;

  public NutchIndexAction(NutchDocument doc, byte action) {
    this.doc = doc;
    this.action = action;
  }
}
{code}

All references of NutchDocument in IndexerMapReduce and IndexerOutputFormat have been replaced with the new NutchIndexAction. It compiles and runs as expected when running locally, without implementing Writable. I also moved the config param from SolrConstants to IndexerMapReduce so that IndexerMapReduce doesn't rely on indexing backend for getting it's param.

Julien, will it break on Hadoop without implementing Writable? As you say i have to implement it, can you give a small example? I assume i have to write and read the class' attributes in order.

Thanks again!


> Multiple deletes of the same URL using SolrClean
> ------------------------------------------------
>
>                 Key: NUTCH-1052
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1052
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.3, 1.4
>            Reporter: Tim Pease
>            Assignee: Julien Nioche
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1052-1.4-1.patch, NUTCH-1052-1.4-2.patch, NUTCH-1052-1.4-3.patch
>
>
> The SolrClean class does not keep track of purged URLs, it only checks the URL status for "db_gone". When run multiple times the same list of URLs will be deleted from Solr. For small, stable crawl databases this is not a problem. For larger crawls this could be an issue. SolrClean will become an expensive operation.
> One solution is to add a "purged" flag in the CrawlDatum metadata. SolrClean would then check this flag in addition to the "db_gone" status before adding the URL to the delete list.
> Another solution is to add a new state to the status field "db_gone_and_purged".
> Either way, the crawl DB will need to be updated after the Solr delete has successfully occurred.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1052) Multiple deletes of the same URL using SolrClean

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13108757#comment-13108757 ] 

Julien Nioche commented on NUTCH-1052:
--------------------------------------

{quote}
Julien, will it break on Hadoop without implementing Writable? As you say i have to implement it, can you give a small example? I assume i have to write and read the class' attributes in order.
{quote}

Look at NutchDocument itself - it is a nice example of a Writable object.

> Multiple deletes of the same URL using SolrClean
> ------------------------------------------------
>
>                 Key: NUTCH-1052
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1052
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.3, 1.4
>            Reporter: Tim Pease
>            Assignee: Julien Nioche
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1052-1.4-1.patch, NUTCH-1052-1.4-2.patch, NUTCH-1052-1.4-3.patch
>
>
> The SolrClean class does not keep track of purged URLs, it only checks the URL status for "db_gone". When run multiple times the same list of URLs will be deleted from Solr. For small, stable crawl databases this is not a problem. For larger crawls this could be an issue. SolrClean will become an expensive operation.
> One solution is to add a "purged" flag in the CrawlDatum metadata. SolrClean would then check this flag in addition to the "db_gone" status before adding the URL to the delete list.
> Another solution is to add a new state to the status field "db_gone_and_purged".
> Either way, the crawl DB will need to be updated after the Solr delete has successfully occurred.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1052) Multiple deletes of the same URL using SolrClean

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13093746#comment-13093746 ] 

Markus Jelsma commented on NUTCH-1052:
--------------------------------------

Updating the CrawlDB is a tedious process and i really would like to avoid updating the DB twice per cycle, it's too heavy in my opinion. I would prefer to do it segment based (or batch based in Nutch 2.0 jargon) or using a timestamp parameter for more control.

Thoughts?

> Multiple deletes of the same URL using SolrClean
> ------------------------------------------------
>
>                 Key: NUTCH-1052
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1052
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.3, 1.4
>            Reporter: Tim Pease
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>
> The SolrClean class does not keep track of purged URLs, it only checks the URL status for "db_gone". When run multiple times the same list of URLs will be deleted from Solr. For small, stable crawl databases this is not a problem. For larger crawls this could be an issue. SolrClean will become an expensive operation.
> One solution is to add a "purged" flag in the CrawlDatum metadata. SolrClean would then check this flag in addition to the "db_gone" status before adding the URL to the delete list.
> Another solution is to add a new state to the status field "db_gone_and_purged".
> Either way, the crawl DB will need to be updated after the Solr delete has successfully occurred.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1052) Multiple deletes of the same URL using SolrClean

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1052:
---------------------------------

    Attachment: NUTCH-1052-1.4-2.patch

Fixed a rarely occuring NPE. Please comment.

> Multiple deletes of the same URL using SolrClean
> ------------------------------------------------
>
>                 Key: NUTCH-1052
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1052
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.3, 1.4
>            Reporter: Tim Pease
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1052-1.4-1.patch, NUTCH-1052-1.4-2.patch
>
>
> The SolrClean class does not keep track of purged URLs, it only checks the URL status for "db_gone". When run multiple times the same list of URLs will be deleted from Solr. For small, stable crawl databases this is not a problem. For larger crawls this could be an issue. SolrClean will become an expensive operation.
> One solution is to add a "purged" flag in the CrawlDatum metadata. SolrClean would then check this flag in addition to the "db_gone" status before adding the URL to the delete list.
> Another solution is to add a new state to the status field "db_gone_and_purged".
> Either way, the crawl DB will need to be updated after the Solr delete has successfully occurred.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1052) Multiple deletes of the same URL using SolrClean

Posted by "Tim Pease (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tim Pease updated NUTCH-1052:
-----------------------------

    Summary: Multiple deletes of the same URL using SolrClean  (was: Multiple delete of the same URL using SolrClean)

> Multiple deletes of the same URL using SolrClean
> ------------------------------------------------
>
>                 Key: NUTCH-1052
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1052
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.3, 1.4
>            Reporter: Tim Pease
>            Priority: Minor
>
> The SolrClean class does not keep track of purged URLs, it only checks the URL status for "db_gone". When run multiple times the same list of URLs will be deleted from Solr. For small, stable crawl databases this is not a problem. For larger crawls this could be an issue. SolrClean will become an expensive operation.
> One solution is to add a "purged" flag in the CrawlDatum metadata. SolrClean would then check this flag in addition to the "db_gone" status before adding the URL to the delete list.
> Another solution is to add a new state to the status field "db_gone_and_purged".
> Either way, the crawl DB will need to be updated after the Solr delete has successfully occurred.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1052) Multiple deletes of the same URL using SolrClean

Posted by "Tim Pease (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13102846#comment-13102846 ] 

Tim Pease commented on NUTCH-1052:
----------------------------------

This patch looks like it will work. The delete method in the SolrWriter should either increment the commitSize counter or a new counter should be created for deleting URLs.

Another thought. Should something similar be done for URLs that have changed into redirects? For example, a webmaster might decide to change their URL slugs. All the old URLs now become 301 redirects to the new URL locations. It would be nice to be able to purge the invalid URLs from Solr.

Thanks for all the work on this issue! My hadoop skills are slowly increasing, and one day soon I'll be able to submit my own patches :)

> Multiple deletes of the same URL using SolrClean
> ------------------------------------------------
>
>                 Key: NUTCH-1052
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1052
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.3, 1.4
>            Reporter: Tim Pease
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1052-1.4-1.patch, NUTCH-1052-1.4-2.patch
>
>
> The SolrClean class does not keep track of purged URLs, it only checks the URL status for "db_gone". When run multiple times the same list of URLs will be deleted from Solr. For small, stable crawl databases this is not a problem. For larger crawls this could be an issue. SolrClean will become an expensive operation.
> One solution is to add a "purged" flag in the CrawlDatum metadata. SolrClean would then check this flag in addition to the "db_gone" status before adding the URL to the delete list.
> Another solution is to add a new state to the status field "db_gone_and_purged".
> Either way, the crawl DB will need to be updated after the Solr delete has successfully occurred.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (NUTCH-1052) Multiple deletes of the same URL using SolrClean

Posted by "Markus Jelsma (Closed) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma closed NUTCH-1052.
--------------------------------


Bulk close of resolved issues of 1.4. bulkclose-1.4-20111220
                
> Multiple deletes of the same URL using SolrClean
> ------------------------------------------------
>
>                 Key: NUTCH-1052
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1052
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.3, 1.4
>            Reporter: Tim Pease
>            Assignee: Julien Nioche
>             Fix For: 1.4, nutchgora
>
>         Attachments: NUTCH-1052-1.4-1.patch, NUTCH-1052-1.4-2.patch, NUTCH-1052-1.4-3.patch
>
>
> The SolrClean class does not keep track of purged URLs, it only checks the URL status for "db_gone". When run multiple times the same list of URLs will be deleted from Solr. For small, stable crawl databases this is not a problem. For larger crawls this could be an issue. SolrClean will become an expensive operation.
> One solution is to add a "purged" flag in the CrawlDatum metadata. SolrClean would then check this flag in addition to the "db_gone" status before adding the URL to the delete list.
> Another solution is to add a new state to the status field "db_gone_and_purged".
> Either way, the crawl DB will need to be updated after the Solr delete has successfully occurred.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1052) Multiple deletes of the same URL using SolrClean

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1052:
---------------------------------

    Attachment: NUTCH-1052-1.4-3.patch

New patch that treats both docs and deletes in the same batch and issues delete commands for perm redirects as well.

> Multiple deletes of the same URL using SolrClean
> ------------------------------------------------
>
>                 Key: NUTCH-1052
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1052
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.3, 1.4
>            Reporter: Tim Pease
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1052-1.4-1.patch, NUTCH-1052-1.4-2.patch, NUTCH-1052-1.4-3.patch
>
>
> The SolrClean class does not keep track of purged URLs, it only checks the URL status for "db_gone". When run multiple times the same list of URLs will be deleted from Solr. For small, stable crawl databases this is not a problem. For larger crawls this could be an issue. SolrClean will become an expensive operation.
> One solution is to add a "purged" flag in the CrawlDatum metadata. SolrClean would then check this flag in addition to the "db_gone" status before adding the URL to the delete list.
> Another solution is to add a new state to the status field "db_gone_and_purged".
> Either way, the crawl DB will need to be updated after the Solr delete has successfully occurred.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (NUTCH-1052) Multiple deletes of the same URL using SolrClean

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma reassigned NUTCH-1052:
------------------------------------

    Assignee: Markus Jelsma

> Multiple deletes of the same URL using SolrClean
> ------------------------------------------------
>
>                 Key: NUTCH-1052
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1052
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.3, 1.4
>            Reporter: Tim Pease
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1052-1.4-1.patch
>
>
> The SolrClean class does not keep track of purged URLs, it only checks the URL status for "db_gone". When run multiple times the same list of URLs will be deleted from Solr. For small, stable crawl databases this is not a problem. For larger crawls this could be an issue. SolrClean will become an expensive operation.
> One solution is to add a "purged" flag in the CrawlDatum metadata. SolrClean would then check this flag in addition to the "db_gone" status before adding the URL to the delete list.
> Another solution is to add a new state to the status field "db_gone_and_purged".
> Either way, the crawl DB will need to be updated after the Solr delete has successfully occurred.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1052) Multiple deletes of the same URL using SolrClean

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1052:
---------------------------------

    Attachment: NUTCH-1052-1.4-1.patch

Here's a patch adding a -delete switch to the solrindex command. It changes IndexerMapReduce to output key,null for records with DB_GONE.

NutchDocument == null is caught in IndexerOutputFormat that will then call a writer.delete(key) method.

I am not sure if this is the correct approach but it works nicely. I did add the delete method's signature to the NutchIndexWriter interface.

Please comment. If this is deemed appropriate i'll change the issue's title to reflect this new approach.

> Multiple deletes of the same URL using SolrClean
> ------------------------------------------------
>
>                 Key: NUTCH-1052
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1052
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.3, 1.4
>            Reporter: Tim Pease
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1052-1.4-1.patch
>
>
> The SolrClean class does not keep track of purged URLs, it only checks the URL status for "db_gone". When run multiple times the same list of URLs will be deleted from Solr. For small, stable crawl databases this is not a problem. For larger crawls this could be an issue. SolrClean will become an expensive operation.
> One solution is to add a "purged" flag in the CrawlDatum metadata. SolrClean would then check this flag in addition to the "db_gone" status before adding the URL to the delete list.
> Another solution is to add a new state to the status field "db_gone_and_purged".
> Either way, the crawl DB will need to be updated after the Solr delete has successfully occurred.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1052) Multiple deletes of the same URL using SolrClean

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13108701#comment-13108701 ] 

Julien Nioche commented on NUTCH-1052:
--------------------------------------

Yep, that's the idea.

The class will have to be Writable and should live at the same place as NutchDocument i.e. org.apache.nutch.indexer



> Multiple deletes of the same URL using SolrClean
> ------------------------------------------------
>
>                 Key: NUTCH-1052
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1052
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.3, 1.4
>            Reporter: Tim Pease
>            Assignee: Julien Nioche
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1052-1.4-1.patch, NUTCH-1052-1.4-2.patch, NUTCH-1052-1.4-3.patch
>
>
> The SolrClean class does not keep track of purged URLs, it only checks the URL status for "db_gone". When run multiple times the same list of URLs will be deleted from Solr. For small, stable crawl databases this is not a problem. For larger crawls this could be an issue. SolrClean will become an expensive operation.
> One solution is to add a "purged" flag in the CrawlDatum metadata. SolrClean would then check this flag in addition to the "db_gone" status before adding the URL to the delete list.
> Another solution is to add a new state to the status field "db_gone_and_purged".
> Either way, the crawl DB will need to be updated after the Solr delete has successfully occurred.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1052) Multiple deletes of the same URL using SolrClean

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13103018#comment-13103018 ] 

Markus Jelsma commented on NUTCH-1052:
--------------------------------------

Although a delete doesn't take much space in the buffer there is a potential of thousands of deletes stacking up; deletes should increment the counter indeed.

Redirects (perm and temp moves) are another problem. During indexing we don't know if a URL has become a redirect. The only solution would be to treat them te same as db_gone. This can lead to a significant number of useless deletes but the same is true for db_gone anyway. Solr, at least, doesn't waste too many cycles on useles delete actions.

I do need another committer's comments on the abuse of the RecordWriter. It works alright but doesn't feel right. A possible solution would be to use a small struct that holds the document and the index/delete flag. It is not possible to pass more parameters than key/value. 

> Multiple deletes of the same URL using SolrClean
> ------------------------------------------------
>
>                 Key: NUTCH-1052
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1052
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.3, 1.4
>            Reporter: Tim Pease
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1052-1.4-1.patch, NUTCH-1052-1.4-2.patch
>
>
> The SolrClean class does not keep track of purged URLs, it only checks the URL status for "db_gone". When run multiple times the same list of URLs will be deleted from Solr. For small, stable crawl databases this is not a problem. For larger crawls this could be an issue. SolrClean will become an expensive operation.
> One solution is to add a "purged" flag in the CrawlDatum metadata. SolrClean would then check this flag in addition to the "db_gone" status before adding the URL to the delete list.
> Another solution is to add a new state to the status field "db_gone_and_purged".
> Either way, the crawl DB will need to be updated after the Solr delete has successfully occurred.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1052) Multiple deletes of the same URL using SolrClean

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1052:
---------------------------------

      Priority: Major  (was: Minor)
    Patch Info: [Patch Available]
      Assignee: Julien Nioche  (was: Markus Jelsma)

Assigned to Julien for review.

> Multiple deletes of the same URL using SolrClean
> ------------------------------------------------
>
>                 Key: NUTCH-1052
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1052
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.3, 1.4
>            Reporter: Tim Pease
>            Assignee: Julien Nioche
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1052-1.4-1.patch, NUTCH-1052-1.4-2.patch, NUTCH-1052-1.4-3.patch
>
>
> The SolrClean class does not keep track of purged URLs, it only checks the URL status for "db_gone". When run multiple times the same list of URLs will be deleted from Solr. For small, stable crawl databases this is not a problem. For larger crawls this could be an issue. SolrClean will become an expensive operation.
> One solution is to add a "purged" flag in the CrawlDatum metadata. SolrClean would then check this flag in addition to the "db_gone" status before adding the URL to the delete list.
> Another solution is to add a new state to the status field "db_gone_and_purged".
> Either way, the crawl DB will need to be updated after the Solr delete has successfully occurred.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (NUTCH-1052) Multiple deletes of the same URL using SolrClean

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13102650#comment-13102650 ] 

Markus Jelsma edited comment on NUTCH-1052 at 9/12/11 1:45 PM:
---------------------------------------------------------------

Fixed a rarely occuring NPE. Please comment.

ps: commented code and debug output is, of course, going to be removed prior to a commit.

      was (Author: markus17):
    Fixed a rarely occuring NPE. Please comment.
  
> Multiple deletes of the same URL using SolrClean
> ------------------------------------------------
>
>                 Key: NUTCH-1052
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1052
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.3, 1.4
>            Reporter: Tim Pease
>            Assignee: Markus Jelsma
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1052-1.4-1.patch, NUTCH-1052-1.4-2.patch
>
>
> The SolrClean class does not keep track of purged URLs, it only checks the URL status for "db_gone". When run multiple times the same list of URLs will be deleted from Solr. For small, stable crawl databases this is not a problem. For larger crawls this could be an issue. SolrClean will become an expensive operation.
> One solution is to add a "purged" flag in the CrawlDatum metadata. SolrClean would then check this flag in addition to the "db_gone" status before adding the URL to the delete list.
> Another solution is to add a new state to the status field "db_gone_and_purged".
> Either way, the crawl DB will need to be updated after the Solr delete has successfully occurred.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (NUTCH-1052) Multiple deletes of the same URL using SolrClean

Posted by "Markus Jelsma (Resolved) (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma resolved NUTCH-1052.
----------------------------------

    Resolution: Won't Fix

Resolved as won't fix. SolrClean won't be modified for this reason and will still exist as a maintenance tool to delete 404's without segment information. This issue is superceded by NUTCH-1139.
                
> Multiple deletes of the same URL using SolrClean
> ------------------------------------------------
>
>                 Key: NUTCH-1052
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1052
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.3, 1.4
>            Reporter: Tim Pease
>            Assignee: Julien Nioche
>             Fix For: 1.4, nutchgora
>
>         Attachments: NUTCH-1052-1.4-1.patch, NUTCH-1052-1.4-2.patch, NUTCH-1052-1.4-3.patch
>
>
> The SolrClean class does not keep track of purged URLs, it only checks the URL status for "db_gone". When run multiple times the same list of URLs will be deleted from Solr. For small, stable crawl databases this is not a problem. For larger crawls this could be an issue. SolrClean will become an expensive operation.
> One solution is to add a "purged" flag in the CrawlDatum metadata. SolrClean would then check this flag in addition to the "db_gone" status before adding the URL to the delete list.
> Another solution is to add a new state to the status field "db_gone_and_purged".
> Either way, the crawl DB will need to be updated after the Solr delete has successfully occurred.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1052) Multiple deletes of the same URL using SolrClean

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13108641#comment-13108641 ] 

Markus Jelsma commented on NUTCH-1052:
--------------------------------------

Thanks for your comments! Just to make sure i understand you correctly. You agree we should use a container object that holds the NutchDocument and the action (ADD || DELETE) and pass it to RecordWriter.write() instead of abusing NULL encoded as delete action as i do it now?

Then i'd need to add a class somewehere such as:

{code}
class NutchIndexAction {
  public static enum Action ADD,DELETE;

  public NutchDocument doc;
  public Action action;
}
{code}

And pass a NutchIndexAction instance to the record writer from IndexerMapReduce? If so, what would be the appropriate location for such a class?

> Multiple deletes of the same URL using SolrClean
> ------------------------------------------------
>
>                 Key: NUTCH-1052
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1052
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.3, 1.4
>            Reporter: Tim Pease
>            Assignee: Julien Nioche
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1052-1.4-1.patch, NUTCH-1052-1.4-2.patch, NUTCH-1052-1.4-3.patch
>
>
> The SolrClean class does not keep track of purged URLs, it only checks the URL status for "db_gone". When run multiple times the same list of URLs will be deleted from Solr. For small, stable crawl databases this is not a problem. For larger crawls this could be an issue. SolrClean will become an expensive operation.
> One solution is to add a "purged" flag in the CrawlDatum metadata. SolrClean would then check this flag in addition to the "db_gone" status before adding the URL to the delete list.
> Another solution is to add a new state to the status field "db_gone_and_purged".
> Either way, the crawl DB will need to be updated after the Solr delete has successfully occurred.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1052) Multiple deletes of the same URL using SolrClean

Posted by "Markus Jelsma (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/NUTCH-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Markus Jelsma updated NUTCH-1052:
---------------------------------

    Fix Version/s: 2.0
                   1.4

> Multiple deletes of the same URL using SolrClean
> ------------------------------------------------
>
>                 Key: NUTCH-1052
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1052
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.3, 1.4
>            Reporter: Tim Pease
>            Priority: Minor
>             Fix For: 1.4, 2.0
>
>
> The SolrClean class does not keep track of purged URLs, it only checks the URL status for "db_gone". When run multiple times the same list of URLs will be deleted from Solr. For small, stable crawl databases this is not a problem. For larger crawls this could be an issue. SolrClean will become an expensive operation.
> One solution is to add a "purged" flag in the CrawlDatum metadata. SolrClean would then check this flag in addition to the "db_gone" status before adding the URL to the delete list.
> Another solution is to add a new state to the status field "db_gone_and_purged".
> Either way, the crawl DB will need to be updated after the Solr delete has successfully occurred.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1052) Multiple deletes of the same URL using SolrClean

Posted by "Julien Nioche (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/NUTCH-1052?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13108633#comment-13108633 ] 

Julien Nioche commented on NUTCH-1052:
--------------------------------------

I like the original idea and agree that having to read/write the whole crawldb once more would be a pain for large crawls. This is a good example of what 2.0 could add (or could have added if you are pessimistic). 

I agree with your suggestion for an alternative to the use of null as value which is to encode the action (add, delete) either as a complex object in the key or as part of the value. The latter would make more sense as it is unlikely that we'd add AND delete the same document as part of the same batch. Could you include that in your patch?

> Multiple deletes of the same URL using SolrClean
> ------------------------------------------------
>
>                 Key: NUTCH-1052
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1052
>             Project: Nutch
>          Issue Type: Improvement
>          Components: indexer
>    Affects Versions: 1.3, 1.4
>            Reporter: Tim Pease
>            Assignee: Julien Nioche
>             Fix For: 1.4, 2.0
>
>         Attachments: NUTCH-1052-1.4-1.patch, NUTCH-1052-1.4-2.patch, NUTCH-1052-1.4-3.patch
>
>
> The SolrClean class does not keep track of purged URLs, it only checks the URL status for "db_gone". When run multiple times the same list of URLs will be deleted from Solr. For small, stable crawl databases this is not a problem. For larger crawls this could be an issue. SolrClean will become an expensive operation.
> One solution is to add a "purged" flag in the CrawlDatum metadata. SolrClean would then check this flag in addition to the "db_gone" status before adding the URL to the delete list.
> Another solution is to add a new state to the status field "db_gone_and_purged".
> Either way, the crawl DB will need to be updated after the Solr delete has successfully occurred.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira