You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Sol Lederman <so...@gmail.com> on 2017/11/21 14:45:04 UTC

Can't get any regex to work in regex-urlfilters.txt

In my regex-urlfilters.txt I have the default filters that come with nutch.
If I have +. as the very last line of the file crawling works fine.

If I change that line to anything else then I get "Total urls rejected by
filters: 1" and no urls are fetched.

I've tried a bunch of different entries in the last line:

+html
+*html
+*html$
+.*(html)$

What am I missing?

Thanks.

Sol

Re: Can't get any regex to work in regex-urlfilters.txt

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Sol,

> doesn't "+html" work as well regardless of what is in seeds.txt? I should
> be able to have http://foo.bar in seeds.txt and "+html" for the regex
> filter, yes?

URL filters are also applied to the seed list by default. That's
why the Injector logs
 Total urls rejected by filters: 1


> All I get back is "-http://foo.bar"

That means that this URL is rejected. Accepted URLs are marked by a leading "+".

> What am I missing?

You may
 - disable URL filters for the injector  (-noFilter)
 - or make sure that all seeds are accepted by the configured URL filters,
   add a rule:
    +http://foo\.far/?$

Best,
Sebastian

On 11/21/2017 09:09 PM, Sol Lederman wrote:
> Sebastian,
> 
> Thanks for the engagement and for the quick reply. I still can't get it to
> work. Here's something I don't understand. I assume that the dot in "+."
> means to match any character so it matches any URL. That's great. Why
> doesn't "+html" work as well regardless of what is in seeds.txt? I should
> be able to have http://foo.bar in seeds.txt and "+html" for the regex
> filter, yes? Or, are you saying that my regex filter has to look something
> like "http://foo.bar/.*html"?
> 
> In any case, I've tried a variety of regex patterns, with and without the
> domain name in them, and none of them work. And, yes, the site in question
> does have files at the top level ending in ".html". And, yes, the default
> nutch.apache.org case crawls fine.
> 
> I also did do the filterchecker test. All I get back is "-http://foo.bar"
> and a return code of 0. I get the same behavior for the working
> nutch.apache.org seed URL.
> 
> What am I missing?
> 
> Thanks again.
> 
> Sol
> 


Re: Can't get any regex to work in regex-urlfilters.txt

Posted by Sol Lederman <so...@gmail.com>.
Sebastian,

Thanks for the engagement and for the quick reply. I still can't get it to
work. Here's something I don't understand. I assume that the dot in "+."
means to match any character so it matches any URL. That's great. Why
doesn't "+html" work as well regardless of what is in seeds.txt? I should
be able to have http://foo.bar in seeds.txt and "+html" for the regex
filter, yes? Or, are you saying that my regex filter has to look something
like "http://foo.bar/.*html"?

In any case, I've tried a variety of regex patterns, with and without the
domain name in them, and none of them work. And, yes, the site in question
does have files at the top level ending in ".html". And, yes, the default
nutch.apache.org case crawls fine.

I also did do the filterchecker test. All I get back is "-http://foo.bar"
and a return code of 0. I get the same behavior for the working
nutch.apache.org seed URL.

What am I missing?

Thanks again.

Sol

Re: Can't get any regex to work in regex-urlfilters.txt

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi,

these are invalid expressions:
> +*html
> +*html$

these should work:
> +html
> +.*(html)$

but the simpler expression would be
+\.html$

Of course, if your seed URL does not match the regular expression it's excluded.
That's the case for, e.g.:
 http://nutch.apache.org/
 http://example.com/index.php

It's better to verify whether the URL filter configuration works as expected
beforehand:
 cat .../seeds.txt | $NUTCH_HOME/bin/nutch filterchecker -allCombined


If you want to keep only HTML pages, have a look at the plugins
  urlfilter-suffix
    to filter away URLs with undesired file extensions (.pdf, .xlsx, etc.)
  mimetype-filter
    to index selectively by MIME type


Best,
Sebastian


On 11/21/2017 03:45 PM, Sol Lederman wrote:
> In my regex-urlfilters.txt I have the default filters that come with nutch.
> If I have +. as the very last line of the file crawling works fine.
> 
> If I change that line to anything else then I get "Total urls rejected by
> filters: 1" and no urls are fetched.
> 
> I've tried a bunch of different entries in the last line:
> 
> +html
> +*html
> +*html$
> +.*(html)$
> 
> What am I missing?
> 
> Thanks.
> 
> Sol
>