You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "ASF GitHub Bot (Jira)" <ji...@apache.org> on 2023/11/17 19:49:00 UTC

[jira] [Commented] (NUTCH-3026) Allow statusOnly option for IndexingJob

    [ https://issues.apache.org/jira/browse/NUTCH-3026?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17787371#comment-17787371 ] 

ASF GitHub Bot commented on NUTCH-3026:
---------------------------------------

tballison opened a new pull request, #799:
URL: https://github.com/apache/nutch/pull/799

   Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Nutch issue tracker](https://issues.apache.org/jira/projects/NUTCH) which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes.
   * the issue ID (`NUTCH-XXXX`)
     - is referenced in the title of the pull request
     - and placed in front of your commit messages surrounded by square brackets (`[NUTCH-XXXX] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Java source code follows [Nutch Eclipse Code Formatting rules](https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml)
   * Nutch is successfully built and unit tests pass by running `ant clean runtime test`
   * there should be no conflicts when merging the pull request branch into the *recent* master branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled master branch.
   * if new dependencies are added,
     - are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](https://www.apache.org/legal/resolved.html#category-a)?
     - are `LICENSE-binary` and `NOTICE-binary` updated accordingly?
   
   We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Nutch in general, please sign up for the [Nutch mailing list](https://nutch.apache.org/mailing_lists.html). Thanks!
   




> Allow statusOnly option for IndexingJob
> ---------------------------------------
>
>                 Key: NUTCH-3026
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3026
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Tim Allison
>            Priority: Major
>
> This issue follows on from discussion here: https://lists.apache.org/thread/mrnsc4hq0h4wovgdppfmrbyo2pxmjpyy
> I'd like to be able to run aggregations and other analytics on the current status of a given crawl outside of Hadoop.
> There are different ways of going about this, and the title of this ticket leads with my preference, but I'm opening this ticket for discussion.
> The goal would be to have an index with information per url on fetch status, http status, parse status, possibly user selected parse metadata when it exists.
> I want to be able to count 404s and other fetch issues (by host). I want to be able to count parse exceptions, file types (by host), etc.
> I do not want to pollute my search index with content-less documents for 404s/parse exceptions etc. I want two indices: one for crawl status and one for search.
> Here are some options I see:
> Option 1: add a "statusOnly" option to the IndexingJob. This would intentionally skip a bunch of the current logic that says "only send to the index if there was a fetch success and there was a parse success and it isn't a duplicate and ...". My proposal would not delete statuses in this index, rather, the working assumption at least to start is that you'd run this on an empty index to get a snapshot of the latest crawl data. We can look into changing this in the future, but not on this ticket.
> Option 2: Copy/paste IndexingJob and then modify it and call it a whole other tool
> Option 3: modify readdb or readseg to do roughly this, but it feels like each one doesn't touch enough of the data components.
> Option 4: I can do effectively option 2 in a personal repo and not add more code to Nutch.
> Other options?
> And, importantly, is there anyone else who would use this? Or is this really only something that I'd want?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)