You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Tejas Patil (JIRA)" <ji...@apache.org> on 2013/05/03 22:16:15 UTC
[jira] [Commented] (NUTCH-649) Log list of files found but not crawled.

    [ https://issues.apache.org/jira/browse/NUTCH-649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13648752#comment-13648752 ] 

Tejas Patil commented on NUTCH-649:
-----------------------------------

Hi [~lewismc],
The method where I need to introduce the counters is not a map / reduce method but its something called while writing records (as its a ParseOutputFormat class). AFAIK, we can use hadoop counters from within a map/reduce method. A simple google search gave me [this |http://stackoverflow.com/questions/12645652/how-to-increment-a-hadoop-counter-from-outside-a-mapper-or-reducer] which indicates the same. Do you know how to do that ? If not, should we go ahead with the current patch ?
                
> Log list of files found but not crawled.
> ----------------------------------------
>
>                 Key: NUTCH-649
>                 URL: https://issues.apache.org/jira/browse/NUTCH-649
>             Project: Nutch
>          Issue Type: Improvement
>          Components: fetcher
>         Environment: any
>            Reporter: Jim
>             Fix For: 1.7, 2.2
>
>         Attachments: NUTCH-649.2.x.patch, NUTCH-649.trunk.patch
>
>
>         I use Nutch to find the location of executables on the web, but we do not download the executables with Nutch.  In order to get nutch to give the location of files without downloading the files, I had to make a very small patch to the code, but I think this change might be useful to others also.  The patch just logs files that are being filtered at the info level, although perhaps it should be at the debug level.
>    I have included a svn diff with this change.  Use cases would be to both use as a diagnostic tool (let's see what we are skipping) as well as a way to find content and links pointed to by a page or site without having to actually download that content.
> Index: ParseOutputFormat.java
> ===================================================================
> --- ParseOutputFormat.java      (revision 593619)
> +++ ParseOutputFormat.java      (working copy)
> @@ -193,17 +193,20 @@
>                 toHost = null;
>               }
>               if (toHost == null || !toHost.equals(fromHost)) { // external links
> +               LOG.info("filtering externalLink " + toUrl + " linked to by " + fromUrl);
> +
>                 continue; // skip it
>               }
>             }
>             try {
>               toUrl = normalizers.normalize(toUrl,
>                           URLNormalizers.SCOPE_OUTLINK); // normalize the url
> -              toUrl = filters.filter(toUrl);   // filter the url
> -              if (toUrl == null) {
> -                continue;
> -              }
> -            } catch (Exception e) {
> +
> +             if (filters.filter(toUrl) == null) {   // filter the url
> +                     LOG.info("filtering content " + toUrl + " linked to by " + fromUrl);
> +                     continue;
> +                 }
> +           } catch (Exception e) {
>               continue;
>             }
>             CrawlDatum target = new CrawlDatum(CrawlDatum.STATUS_LINKED, interval);

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira