You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Gaurang Patel <ga...@gmail.com> on 2009/10/06 10:36:07 UTC

Authenticity of URLs from DMOZ

Hey,

Can anyone tell what could be the reason for following which happened while
fetching data using bin/nutch fetch:

My AVG Antivirus is detecting virus threats while Nutch fetches pages from
available urls of *crawldb.* I injected DMOZ Open Directory urls to crawldb.
Antivirus already detected 4 threats within only half an hour after start of
fetching.

Is there any other way(any source other than DMOZ) to get list of whole web
urls ? Or is there an automatic way to avoid such harrmful urls from being
fetched? Let me know asap.


Regards,
Gaurang

Re: Authenticity of URLs from DMOZ

Posted by David Jashi <da...@jashi.ge>.
Gaurang,

About that AVG alerts - you are fetching web pages together with all
viruses they may be infected with.
Of course, antivirus software will scream about it.

I wouldn't run any kind of such software on crawling machine.

პატივისცემით,
დავით ჯაში




On Tue, Oct 6, 2009 at 12:36, Gaurang Patel <ga...@gmail.com> wrote:
> Hey,
>
> Can anyone tell what could be the reason for following which happened while
> fetching data using bin/nutch fetch:
>
> My AVG Antivirus is detecting virus threats while Nutch fetches pages from
> available urls of *crawldb.* I injected DMOZ Open Directory urls to crawldb.
> Antivirus already detected 4 threats within only half an hour after start of
> fetching.
>
> Is there any other way(any source other than DMOZ) to get list of whole web
> urls ? Or is there an automatic way to avoid such harrmful urls from being
> fetched? Let me know asap.
>
>
> Regards,
> Gaurang
>