You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Danicela nutch <Da...@mail.com> on 2012/02/20 17:17:16 UTC

Re : Re : Re: Too few parsed pages

Hi,

I tested a fetch on a segment with few parsed pages with Webscarab.

In fact, 95% of all HTTP requests get a 404 response as the pages don't longer exist.

Such pages have the status 'db_gone' in the crawldb, but they are still generated and fetched.

I noticed that in Nutch 1.4, there is an option to remove the db_gone urls from the crawldb, that would solve my problem.

But currently, I use a script to monitor Nutch processes and it's not compatible with Nutch 1.4, some development is needed to adapt it, but is there something I can do to remove these db_gone in Nutch 1.2 ?

Thanks.

----- Message d'origine -----
De : Danicela nutch
Envoyés : 17.02.12 17:32
À : user@nutch.apache.org, markus.jelsma@openindex.io
Objet : Re : Re: Too few parsed pages

Hi, I have now 242 parsed pages for 18662 fetched pages. The performance of my crawl has been significantly reduced due to this poor efficiency. Is there anything I can do to prevent this ? Shouldn't the generate choose newer URLs instead of already fetched ones ? If I understand, the pages which aren't parsed are pages that did not change since the last fetch, does it mean that the HTML contents of each page is sent in the segment to the fetchlist during the generate ? I mean, if the parser makes the comparison between the current and the older contents, it should have the old content in the segment, as it doesn't read the crawldb. If this is true, does it also mean that the crawldb contains all HTML contents from all pages ? (as the generate gives it to the segments) Thanks for helping. ----- Message d'origine ----- De : Markus Jelsma Envoyés : 06.02.12 17:06 À : user@nutch.apache.org Objet : Re: Too few parsed pages Nothing, this is good. If a page is not modified you d
on't need to parse it again as it was already parsed in an older segment. On Monday 06 February 2012 17:03:52 Danicela nutch wrote: > I don't understand, what should I do ? > > ----- Message d'origine ----- > De : Markus Jelsma > Envoyés : 06.02.12 16:45 > À : user@nutch.apache.org > Objet : Re: Too few parsed pages > > Likely db_not_modified records, they are not parsed. On Monday 06 February > 2012 16:44:25 Danicela nutch wrote: > Hi, > > When I make a readseg -list > on a segment, I have 60.000 'FETCHED' pages, > but only 10.000 'PARSED' > pages. One month ago, I had something like 40.000 > 'PARSED' pages in my > segments, and this number reduced a little every day. > If I look in the > logs of the segments, I can find approximately these > numbers if I count > the number of treated pages. But I find nothing strange > in the parse > that could explain the fact I have so few pages in the end. > > What can > explain th e fact I have so few pages which are parsed ? > > Tha
nks. -- > Markus Jelsma - CTO - Openindex -- Markus Jelsma - CTO - Openindex