You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Stephen R Guglielmo <sr...@gmail.com> on 2017/04/04 19:46:03 UTC

Regex URL Filter Question

Hi list,

I'm working on configuring Nutch with ElasticSearch to provide a
website search functionality. I've been reading the Nutch
documentation and the NutchTutorial. In the NutchTutorial on the Wiki,
the section "Configure Regular Expression Filters" gives the example
of:

+^http://([a-z0-9]*\.)*nutch.apache.org/

However, I am a bit confused by this. Firstly, do / not need to be
escaped as usual in a regular expression? As in ^http:\/\/(a-z.....
instead of ^http://(a-z....

Also, I notice the first period is escaped, but the two periods in
"nutch.apache.org" are not escaped. Periods are normally wildcards in
regular expressions, hence my confusion.

Is this an error in the documentation? Are these regexes PCRE or POSIX?

Thank you!
Steve