You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Roger Marin <rs...@gmail.com> on 2010/08/11 02:36:21 UTC

Dynamically set urlfilter.regex.file possible?

Hello,

I'm using the crawl api from my application and I'm trying to set the
urlfilter.regex.file parameter using a crawl-urlfilter.txt file that's
dynamically generated for each crawl
but it seems it's not as straightforward as just replacing that parameter
with the file path. No matter what I try it never seems to load the file
correctly, at first I get
entries in the log that say that these classes:
org.apache.hadoop.conf.Configuration and
org.apache.nutch.urlfilter.api.RegexURLFilterBase cannot find the resource,
but then
it seems to be parsing the file but keeps throwing
java.net.MalformedURLException: no protocol: for each line in the file, not
sure what I'm doing wrong, is this even possible?

87432 [Thread-31] WARN org.apache.nutch.crawl.Injector - Skipping # skip
file:, ftp:, & mailto: urls:java.net.MalformedURLException: no protocol: #
skip file:, ftp:, & mailto: urls
87433 [Thread-31] WARN org.apache.nutch.crawl.Injector - Skipping
-^(file|ftp|mailto)::java.net.MalformedURLException: no protocol:
-^(file|ftp|mailto):
87433 [Thread-31] WARN
org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer - can't find
rules for scope 'inject', using default
87433 [Thread-31] WARN org.apache.nutch.crawl.Injector - Skipping
:java.lang.NullPointerException
87434 [Thread-31] WARN org.apache.nutch.crawl.Injector - Skipping # skip
image and other suffixes we can't yet parse:java.net.MalformedURLException:
no protocol: # skip image and other suffixes we can't yet parse
87434 [Thread-31] WARN org.apache.nutch.crawl.Injector - Skipping
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$:java.net.MalformedURLException:
no protocol:
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
87434 [Thread-31] WARN org.apache.nutch.crawl.Injector - Skipping
:java.lang.NullPointerException
87434 [Thread-31] WARN org.apache.nutch.crawl.Injector - Skipping # skip
URLs containing certain characters as probable queries,
etc.:java.net.MalformedURLException: no protocol: # skip URLs containing
certain characters as probable queries, etc.
87435 [Thread-31] WARN org.apache.nutch.crawl.Injector - Skipping
-[]:java.net.MalformedURLException: no protocol: -[]
87435 [Thread-31] WARN org.apache.nutch.crawl.Injector - Skipping
:java.lang.NullPointerException
87435 [Thread-31] WARN org.apache.nutch.crawl.Injector - Skipping # skip
URLs with slash-delimited segment that repeats 3+ times, to break
loops:java.net.MalformedURLException: no protocol: # skip URLs with
slash-delimited segment that repeats 3+ times, to break loops
87436 [Thread-31] WARN org.apache.nutch.crawl.Injector - Skipping
-.*(/[^/]+)/[^/]+\1/[^/]+\1/:java.net.MalformedURLException: no protocol:
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
87436 [Thread-31] WARN org.apache.nutch.crawl.Injector - Skipping
:java.lang.NullPointerException
87437 [Thread-31] WARN org.apache.nutch.crawl.Injector - Skipping # accept
hosts in MY.DOMAIN.NAME:java.net.MalformedURLException: no protocol: #
accept hosts in MY.DOMAIN.NAME
87437 [Thread-31] WARN org.apache.nutch.crawl.Injector - Skipping +^
http://localhost/:java.net.MalformedURLException: no protocol: +^
http://localhost/
87437 [Thread-31] WARN org.apache.nutch.crawl.Injector - Skipping
:java.lang.NullPointerException
87438 [Thread-31] WARN org.apache.nutch.crawl.Injector - Skipping # skip
everything else:java.net.MalformedURLException: no protocol: # skip
everything else
87438 [Thread-31] WARN org.apache.nutch.crawl.Injector - Skipping
-.:java.net.MalformedURLException: no protocol: -.
87438 [Thread-31] WARN org.apache.nutch.crawl.Injector - Skipping
:java.lang.NullPointerException

Thanks.