You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "NG-Marketing, M.Schneider" <sc...@ng-marketing.com> on 2006/07/12 12:10:00 UTC

Pornfilter

Hello List,

 

does anyone of you have a "pornfilter" not to fetch those URLs and therefore
save bandwidth and storage space?

 

I could do that with regular expressions and the URL-filter, of course. But
maybe there is another way and somebody already made a plugin for that. Any
hints would be great.

 

Yours

Matthias

 


Re: Pornfilter

Posted by Ken Krugler <kk...@transpac.com>.
>does anyone of you have a "pornfilter" not to fetch those URLs and therefore
>save bandwidth and storage space?
>
>I could do that with regular expressions and the URL-filter, of course. But
>maybe there is another way and somebody already made a plugin for that. Any
>hints would be great.

We have an "adult content" filter that we'll be contributing back to 
Nutch. It uses keywords from the URL, content and meta fields to 
generate a probability value. We then flag pages as probable or 
possible adult, based on ranges. Seems to be working pretty well for 
us, though now we need to replicate & re-tune for poker and drug spam.

Note that this does mean that an adult page does get fetched, but 
where it's a win is in penalizing (via OPIC-style scoring) pages that 
this adult page points to. So we still wind up fetching a lot fewer 
worthless pages.

One potential problem this creates is that a lot of adult sites 
contain links to download various video player software. So some 
high-level pages at Adobe, Microsoft, Apple, etc. wind up getting 
identified as also being "adult" in nature, but since those pages 
aren't part of our focused crawl anyway, it's not a big deal for us.

-- Ken
-- 
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"

Re: Pornfilter

Posted by Matthias Jaekle <ja...@eventax.de>.
Hi,

We once downloaded a very flexibel regex from squidguard to ignore most 
of the porn urls.

Matthias


NG-Marketing, M.Schneider wrote:
> Hello List,
> 
>  
> 
> does anyone of you have a "pornfilter" not to fetch those URLs and therefore
> save bandwidth and storage space?
> 
>  
> 
> I could do that with regular expressions and the URL-filter, of course. But
> maybe there is another way and somebody already made a plugin for that. Any
> hints would be great.
> 
>  
> 
> Yours
> 
> Matthias
> 
>  
> 
>