You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by "NG-Marketing, M.Schneider" <sc...@ng-marketing.com> on 2006/07/12 12:10:00 UTC
Pornfilter
Hello List,
does anyone of you have a "pornfilter" not to fetch those URLs and therefore
save bandwidth and storage space?
I could do that with regular expressions and the URL-filter, of course. But
maybe there is another way and somebody already made a plugin for that. Any
hints would be great.
Yours
Matthias
Re: Pornfilter
Posted by Ken Krugler <kk...@transpac.com>.
>does anyone of you have a "pornfilter" not to fetch those URLs and therefore
>save bandwidth and storage space?
>
>I could do that with regular expressions and the URL-filter, of course. But
>maybe there is another way and somebody already made a plugin for that. Any
>hints would be great.
We have an "adult content" filter that we'll be contributing back to
Nutch. It uses keywords from the URL, content and meta fields to
generate a probability value. We then flag pages as probable or
possible adult, based on ranges. Seems to be working pretty well for
us, though now we need to replicate & re-tune for poker and drug spam.
Note that this does mean that an adult page does get fetched, but
where it's a win is in penalizing (via OPIC-style scoring) pages that
this adult page points to. So we still wind up fetching a lot fewer
worthless pages.
One potential problem this creates is that a lot of adult sites
contain links to download various video player software. So some
high-level pages at Adobe, Microsoft, Apple, etc. wind up getting
identified as also being "adult" in nature, but since those pages
aren't part of our focused crawl anyway, it's not a big deal for us.
-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"Find Code, Find Answers"
Re: Pornfilter
Posted by Matthias Jaekle <ja...@eventax.de>.
Hi,
We once downloaded a very flexibel regex from squidguard to ignore most
of the porn urls.
Matthias
NG-Marketing, M.Schneider wrote:
> Hello List,
>
>
>
> does anyone of you have a "pornfilter" not to fetch those URLs and therefore
> save bandwidth and storage space?
>
>
>
> I could do that with regular expressions and the URL-filter, of course. But
> maybe there is another way and somebody already made a plugin for that. Any
> hints would be great.
>
>
>
> Yours
>
> Matthias
>
>
>
>