You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Cam Bazz <ca...@gmail.com> on 2011/07/07 17:18:42 UTC

inject will not take all the urls

Hello,

I am trying to inject a set of urls, in range of 800K. however it
seems that only half of them are injected to crawldb? (I am checking
with -stats option)

I wonder why?

Best Regards,
-C.B.

Re: inject will not take all the urls

Posted by Markus Jelsma <ma...@openindex.io>.
Check your URL filters. This is the most common pitfall with injection. Most 
likely a fair amount of URLs are removed by the filters.

On Thursday 07 July 2011 17:18:42 Cam Bazz wrote:
> Hello,
> 
> I am trying to inject a set of urls, in range of 800K. however it
> seems that only half of them are injected to crawldb? (I am checking
> with -stats option)
> 
> I wonder why?
> 
> Best Regards,
> -C.B.

-- 
Markus Jelsma - CTO - Openindex
http://www.linkedin.com/in/markus17
050-8536620 / 06-50258350