You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Jason Tsai <ge...@gmail.com> on 2014/01/14 03:02:28 UTC

need help about urlfilter

Hi there
I'm new to Nutch and I have build Nutch2.2.1 with Hadoop1.0.4 include Hbase.
I realized Nutch can filter URL by edit the regex-urlfilter.txt but I
failed all the time no matter how I try.
My seed.txt in HDFS only have one URL:http://www.cancer.gov/
and I'm trying to get the pages' URL like
http://www.cancer.gov/cancertopics/druginfo/<http://www.cancer.gov/cancertopics/druginfo/lungcancer>example
only.
I had tried +^http://www.cancer.gov/cancertopics/druginfo<http://www.cancer.gov/cancertopics/druginfo/lungcancer>.*
or +^http://www.cancer.gov/cancertopics/druginfo/<http://www.cancer.gov/cancertopics/druginfo/lungcancer>([a-z0-9]*\.)
and so on in regex-urlfilter.txt. But none of them work.
Can anyone give a hand?
Thanks!

Re: need help about urlfilter

Posted by Jason Tsai <ge...@gmail.com>.

nobody here?...


On Tue, Jan 14, 2014 at 10:02 AM, Jason Tsai <ge...@gmail.com> wrote:

> Hi there
> I'm new to Nutch and I have build Nutch2.2.1 with Hadoop1.0.4 include
> Hbase.
> I realized Nutch can filter URL by edit the regex-urlfilter.txt but I
> failed all the time no matter how I try.
> My seed.txt in HDFS only have one URL:http://www.cancer.gov/
> and I'm trying to get the pages' URL like
> http://www.cancer.gov/cancertopics/druginfo/<http://www.cancer.gov/cancertopics/druginfo/lungcancer>example
> only.
> I had tried +^http://www.cancer.gov/cancertopics/druginfo<http://www.cancer.gov/cancertopics/druginfo/lungcancer>.*
> or +^http://www.cancer.gov/cancertopics/druginfo/<http://www.cancer.gov/cancertopics/druginfo/lungcancer>([a-z0-9]*\.)
> and so on in regex-urlfilter.txt. But none of them work.
> Can anyone give a hand?
> Thanks!
>

need help about urlfilter

Posted by Jason Tsai <ge...@gmail.com>.

Hi there
I'm new to Nutch and I have build Nutch2.2.1 with Hadoop1.0.4 include Hbase.
I realized Nutch can filter URL by edit the regex-urlfilter.txt but I
failed all the time no matter how I try.
My seed.txt in HDFS only have one URL:http://www.cancer.gov/
and I'm trying to get the pages' URL like
http://www.cancer.gov/cancertopics/druginfo/<http://www.cancer.gov/cancertopics/druginfo/lungcancer>example
only.
I had tried +^http://www.cancer.gov/cancertopics/druginfo<http://www.cancer.gov/cancertopics/druginfo/lungcancer>.*
or +^http://www.cancer.gov/cancertopics/druginfo/<http://www.cancer.gov/cancertopics/druginfo/lungcancer>([a-z0-9]*\.)
and so on in regex-urlfilter.txt. But none of them work.
Can anyone give a hand?
Thanks!