You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Alex McLintock (JIRA)" <ji...@apache.org> on 2010/06/25 14:24:49 UTC

[jira] Commented: (NUTCH-363) Fetcher normalizes everything at least twice

    [ https://issues.apache.org/jira/browse/NUTCH-363?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12882544#action_12882544 ] 

Alex McLintock commented on NUTCH-363:
--------------------------------------

So this issue can be closed, right? Any objections?

> Fetcher normalizes everything at least twice
> --------------------------------------------
>
>                 Key: NUTCH-363
>                 URL: https://issues.apache.org/jira/browse/NUTCH-363
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8
>         Environment: OS X 10.4.7
>            Reporter: Doug Cook
>            Priority: Minor
>
> New links are normalized twice by the fetcher: 
> First in DOMContentUtils.getOutlinks, where the constructor Outlink(url.toString(), linkText.toString().trim(), conf)  normalizes the URL.
> The second time is in ParseOutputFormat.write().
> For some URLs (e.g. those repeated on a page) a given URL may be normalized a number of times, but it is always normalized at least twice.
> For those of us with expensive normalizations, this is probably burning some CPU. 
> I'd gladly fix this, but I'm not yet familiar enough with the code to know if there are some hidden assumptions which rely on this behavior.
> [A related note is that URLs are normalized *before* filtering; this is causing a lot of extra normalization as well. In general, filters may not be safe to run before normalization, but there is likely a class of them which are (filtering out .gif/.jpg etc). Perhaps the notion of a "pre-normalizer filter" would be a useful one?]

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.