You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Gaurang Patel <ga...@gmail.com> on 2009/10/06 10:36:07 UTC
Authenticity of URLs from DMOZ
Hey,
Can anyone tell what could be the reason for following which happened while
fetching data using bin/nutch fetch:
My AVG Antivirus is detecting virus threats while Nutch fetches pages from
available urls of *crawldb.* I injected DMOZ Open Directory urls to crawldb.
Antivirus already detected 4 threats within only half an hour after start of
fetching.
Is there any other way(any source other than DMOZ) to get list of whole web
urls ? Or is there an automatic way to avoid such harrmful urls from being
fetched? Let me know asap.
Regards,
Gaurang
Re: Authenticity of URLs from DMOZ
Posted by David Jashi <da...@jashi.ge>.
Gaurang,
About that AVG alerts - you are fetching web pages together with all
viruses they may be infected with.
Of course, antivirus software will scream about it.
I wouldn't run any kind of such software on crawling machine.
პატივისცემით,
დავით ჯაში
On Tue, Oct 6, 2009 at 12:36, Gaurang Patel <ga...@gmail.com> wrote:
> Hey,
>
> Can anyone tell what could be the reason for following which happened while
> fetching data using bin/nutch fetch:
>
> My AVG Antivirus is detecting virus threats while Nutch fetches pages from
> available urls of *crawldb.* I injected DMOZ Open Directory urls to crawldb.
> Antivirus already detected 4 threats within only half an hour after start of
> fetching.
>
> Is there any other way(any source other than DMOZ) to get list of whole web
> urls ? Or is there an automatic way to avoid such harrmful urls from being
> fetched? Let me know asap.
>
>
> Regards,
> Gaurang
>