You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Tim Allison (Jira)" <ji...@apache.org> on 2023/11/17 16:55:00 UTC

[jira] [Created] (NUTCH-3026) Allow statusOnly option for IndexingJob

Tim Allison created NUTCH-3026:
----------------------------------

             Summary: Allow statusOnly option for IndexingJob
                 Key: NUTCH-3026
                 URL: https://issues.apache.org/jira/browse/NUTCH-3026
             Project: Nutch
          Issue Type: Task
            Reporter: Tim Allison


This issue follows on from discussion here: https://lists.apache.org/thread/mrnsc4hq0h4wovgdppfmrbyo2pxmjpyy

I'd like to be able to run aggregations and other analytics on the current status of a given crawl outside of Hadoop.

There are different ways of going about this, and the title of this ticket leads with my preference, but I'm opening this ticket for discussion.

The goal would be to have an index with information per url on fetch status, http status, parse status, possibly user selected parse metadata when it exists.

I want to be able to count 404s and other fetch issues (by host). I want to be able to count parse exceptions, file types (by host), etc.

I do not want to pollute my search index with content-less documents for 404s/parse exceptions etc. I want a separate index.

Here are some options I see:

Option 1: add a "statusOnly" option to the IndexingJob. This would intentionally skip a bunch of the current logic that says "only send to the index if there was a fetch success and there was a parse success and it isn't a duplicate and ...". My proposal would not delete statuses in this index, rather, the working assumption at least to start is that you'd run this on an empty index to get a snapshot of the latest crawl data. We can look into changing this in the future, but not on this ticket.

Option 2: Copy/paste IndexingJob and then modify it and call it a whole other tool

Option 3: modify readdb or readseg to do roughly this, but it feels like each one doesn't touch enough of the data components.

Option 4: I can do effectively option 2 in a personal repo and not add more code to Nutch.

Other options?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)