You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Jay Yu <jy...@looksmart.net> on 2005/04/01 01:26:02 UTC
RE: [jira] Commented: (NUTCH-7) analyze tool takes up all the dis k space when there are circular links

Is your change to the update db tool going to be in the next release?
Have you tested it?

Thanks for the fix!


-----Original Message-----
From: Phoebe Miller (JIRA) [mailto:jira@apache.org]
Sent: Thursday, March 31, 2005 8:59 AM
To: nutch-dev@incubator.apache.org
Subject: [jira] Commented: (NUTCH-7) analyze tool takes up all the disk
space when there are circular links


     [
http://issues.apache.org/jira/browse/NUTCH-7?page=comments#action_61899 ]
     
Phoebe Miller commented on NUTCH-7:
-----------------------------------

I have fixed this problem by changing the update database tool, basically,
links from a page is not added if the page has already been processed and
current (same MD5). Now link analysis won't run into these infinite chains
of links.

Here is the diff in UpdateDatabaseTool.java.


64d63
<     private IWebDBReader webdbread;
72c71
<     public UpdateDatabaseTool(IWebDBWriter webdb,  IWebDBReader webdbread,
boolean additionsAllowed, int maxCount) {
---
>     public UpdateDatabaseTool(IWebDBWriter webdb, boolean
additionsAllowed, int maxCount) {
74d72
<         this.webdbread = webdbread;
229,231d226
< // If the page is already in the db, so are the links,
< // This should take care of relative links and symlinks to itself.
<                                 if
(!webdbread.pageExists(newPage.getMD5()))  // page not seen before
365,366c360
<       IWebDBReader webdbread = new WebDBReader(nfs, root);
<       UpdateDatabaseTool tool = new UpdateDatabaseTool(webdb, webdbread,
additionsAllowed, max);
---
>       UpdateDatabaseTool tool = new UpdateDatabaseTool(webdb,
additionsAllowed, max);


> analyze tool takes up all the disk space when there are circular links
> ----------------------------------------------------------------------
>
>          Key: NUTCH-7
>          URL: http://issues.apache.org/jira/browse/NUTCH-7
>      Project: Nutch
>         Type: Bug
>   Components: indexer
>  Environment: analyze runs for an excessive amount of time and creates
huge temp files until it runs out of disk space (if you let the db grow)
>     Reporter: Phoebe Miller

>
> It is repeatable by running an instance with these seeds:
>
http://www.acf.hhs.gov/programs/ofs/forms.htm/grants/grants/grants/grants/da
ta/grants/data/data/data/data/grants/data/grants/grants/grants/process.htm
> http://www.acf.hhs.gov/programs/ofs/
> and limit it (for best effect) to just:
> *.acf.hhs.gov/*
> Let it go for about 12 cycles to build it up and the temp file size
roughly doubles with each segment.
> ]$ ls -l /db/tmpdir2344la/
> ...
> 1503641425 Mar 10 17:42 scoreEdits.0.unsorted
> for a very small db:
> Stats for net.nutch.db.WebDBReader@89cf1e
> -------------------------------
> Number of pages: 6916
> Number of links: 8085
> scoreEdits.0.sorted.0 contains rows of links that looked like the first
seed url, but with more grants/ and data/ in the sub dirs.
> In the File:
> .DistributedAnalysisTool.java
>  345                     if (curIndex - startIndex > extent) {
>  346                         break;
>  347                     }
> is the hard stop.
> Further down the score is written:
> 381  for (int i = 0; i < outLinks.length; i++) {
> ...
> 385     scoreWriter.append(outLinks[i].getURL(), score);
> Putting a check here stops the tmpdir.../scoreEdits.0 file growth
> but the links themselves should not be produced in the generation either.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira