You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Doug Cook (JIRA)" <ji...@apache.org> on 2006/09/08 20:47:22 UTC

[jira] Created: (NUTCH-363) Fetcher normalizes everything at least twice

Fetcher normalizes everything at least twice
--------------------------------------------

                 Key: NUTCH-363
                 URL: http://issues.apache.org/jira/browse/NUTCH-363
             Project: Nutch
          Issue Type: Bug
          Components: fetcher
    Affects Versions: 0.8
         Environment: OS X 10.4.7
            Reporter: Doug Cook
            Priority: Minor


New links are normalized twice by the fetcher: 

First in DOMContentUtils.getOutlinks, where the constructor Outlink(url.toString(), linkText.toString().trim(), conf)  normalizes the URL.

The second time is in ParseOutputFormat.write().

For some URLs (e.g. those repeated on a page) a given URL may be normalized a number of times, but it is always normalized at least twice.

For those of us with expensive normalizations, this is probably burning some CPU. 

I'd gladly fix this, but I'm not yet familiar enough with the code to know if there are some hidden assumptions which rely on this behavior.

[A related note is that URLs are normalized *before* filtering; this is causing a lot of extra normalization as well. In general, filters may not be safe to run before normalization, but there is likely a class of them which are (filtering out .gif/.jpg etc). Perhaps the notion of a "pre-normalizer filter" would be a useful one?]


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira