You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Doug Cutting (JIRA)" <ji...@apache.org> on 2006/03/20 19:42:59 UTC

[jira] Commented: (NUTCH-235) Duplicate Inlink values

    [ http://issues.apache.org/jira/browse/NUTCH-235?page=comments#action_12371122 ] 

Doug Cutting commented on NUTCH-235:
------------------------------------

I'm concerned about all of the contains() calls this adds to an ArrayList.  This is a linear scan, and makes the cost of building a set of links quadratic.  If we're making this change, shouldn't we change the underlying set implementation too?  A HashSet would probably work well here.

> Duplicate Inlink values
> -----------------------
>
>          Key: NUTCH-235
>          URL: http://issues.apache.org/jira/browse/NUTCH-235
>      Project: Nutch
>         Type: Bug
>     Versions: 0.8-dev
>     Reporter: Andrzej Bialecki 
>     Assignee: Andrzej Bialecki 
>  Attachments: patch.txt
>
> Reading the code for LinkDb.reduce():  if we have page duplicates in input segments, or if we have two copies of the same input segment, we will create the same Inlink values (satisfying Inlink.equals()) multiple times. Since Inlinks is a facade for List, and not a Set, we will get duplicate Inlink-s in Inlinks (if you know what I mean  ;) .
> The problem is easy to test: create a new linkdb based on 2 identical segments. This problem also makes it more difficult to properly implement LinkDB updating mechanism (i.e. incremental invertlinks).
> I propose to change Inlinks to use a Set semantics, either explicitly by using a HashSet or implicitly by checking if a value to be added already exists. If there are no objections I'll commit this change shortly.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira