You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Ken Ken <ke...@yahoo.com> on 2010/01/09 09:30:46 UTC

regex-urlfilter.txt: only crawl .com tld

/nutch-1.0/conf/regex-urlfilter.txt

Hello,

I just want to fetch/crawl all .com domain names, so what should I put in the /nutch-1.0/conf/regex-urlfilter.txt file

e.g.
+^http://([a-z0-9]*\.)*apache.org/

Correct me if I am wrong.  I think the above only crawl/fetch apache.org and apache.org's subdomains, but I am not sure if it will fetch/crawl sub-subdomains of apache.org

I wonder if this will fetech/crawl only .com domain names.

+^http://([a-z0-9]*\.)*com/

If so, how do I also get it to crawl/fetch subdomain and sub-subdomains (http://subdomain.subdomain.yourname.com) also?

Thanks

Re: regex-urlfilter.txt: only crawl .com tld

Posted by reinhard schwab <re...@aon.at>.

Ken Ken schrieb:
> /nutch-1.0/conf/regex-urlfilter.txt
>
> Hello,
>
> I just want to fetch/crawl all .com domain names, so what should I put in the /nutch-1.0/conf/regex-urlfilter.txt file
>
> e.g.
> +^http://([a-z0-9]*\.)*apache.org/
>
> Correct me if I am wrong.  I think the above only crawl/fetch apache.org and apache.org's subdomains, but I am not sure if it will fetch/crawl sub-subdomains of apache.org
>   
it will.
test this code

static void test3() {
    String pattern = "^http://([a-z0-9]*\\.)*apache.org/";
    String input = "http://a0.a1.a2.apache.org/";
    Pattern p = Pattern.compile(pattern);
    Matcher m = p.matcher(input);
    while (m.find())
      System.out.println("Found: " + m.group());
  }
 
output is
Found: http://a0.a1.a2.apache.org/

> I wonder if this will fetech/crawl only .com domain names.
>
> +^http://([a-z0-9]*\.)*com/
>
> If so, how do I also get it to crawl/fetch subdomain and sub-subdomains (http://subdomain.subdomain.yourname.com) also?
>
> Thanks
>
>
>       
>

Re: regex-urlfilter.txt: only crawl .com tld

Posted by James Todd <ja...@gmail.com>.

here's how i test regex-urlfilter entries:

$ echo "[url]" | java -cp
./nutch-1.0.jar:./plugins/urlfilter-regex/urlfilter-regex.jar:./plugins/lib-regex-filter/lib-regex-filter.jar:./lib/hadoop-0.19.1-core.jar:./lib/commons-logging-1.0.4.jar:./lib/commons-logging-api-1.0.4.jar:./conf
 org.apache.nutch.urlfilter.regex.RegexURLFilter

replace [url] w/ urls you'd like to test.

hth,

- james

On Sat, Jan 9, 2010 at 12:30 AM, Ken Ken <ke...@yahoo.com> wrote:

> /nutch-1.0/conf/regex-urlfilter.txt
>
> Hello,
>
> I just want to fetch/crawl all .com domain names, so what should I put in
> the /nutch-1.0/conf/regex-urlfilter.txt file
>
> e.g.
> +^http://([a-z0-9]*\.)*apache.org/
>
> Correct me if I am wrong.  I think the above only crawl/fetch apache.organd
> apache.org's subdomains, but I am not sure if it will fetch/crawl
> sub-subdomains of apache.org
>
> I wonder if this will fetech/crawl only .com domain names.
>
> +^http://([a-z0-9]*\.)*com/
>
> If so, how do I also get it to crawl/fetch subdomain and sub-subdomains (
> http://subdomain.subdomain.yourname.com) also?
>
> Thanks
>
>
>