You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Danilo Fernandes <da...@kelsorfernandes.com.br> on 2013/02/25 16:09:48 UTC

regex-urlfilter file for multiple domains

Hello,


I started with crawling a site and I didn't have any problems. But, I need
define criteria to each domain.

 

How can I create differents regex-urlfilter for each of them?

 

Actually the ideia is catch some pages of each site and no all. Each one
have a different structure and I need cover all of them.

 

Like:

 

Domain1.com/sale = I want catch.

Domain1.com/cars = I don't.

 

Regex: -Domain1.com/[^s].*

 

Domain2.com/flytickets = I want catch.

Domian2.com/contatPage = I don't.

 

Regex: -Domain2.com/[^f].*

 

Is it possible?

 

Thank's Again.

 

Danilo Fernandes


Re: regex-urlfilter file for multiple domains

Posted by Tejas Patil <te...@gmail.com>.
Hey Danilo,

On Mon, Feb 25, 2013 at 7:09 AM, Danilo Fernandes <
danilo@kelsorfernandes.com.br> wrote:

> Hello,
>
>
> I started with crawling a site and I didn't have any problems. But, I need
> define criteria to each domain.
>
>
>
> How can I create differents regex-urlfilter for each of them?
>
>
>
> Actually the ideia is catch some pages of each site and no all. Each one
> have a different structure and I need cover all of them.
>
>
>
> Like:
>
>
>
> Domain1.com/sale = I want catch.
>
> Domain1.com/cars = I don't.
>
>
>
> Regex: -Domain1.com/[^s].*
>
>
>
> Domain2.com/flytickets = I want catch.
>
> Domian2.com/contatPage = I don't.
>
>
>
> Regex: -Domain2.com/[^f].*
>
>
>
> Is it possible?
>
> Yes.
You can do following:
1. just have accept rules and a "-." in the end to omit urls which dont
match.
2. just have reject rules and a "+." in the end to accept urls which get
rejected.
3. A combination of both.

Say you go by #1. Then for the given example, it would be something like:
------------------------------------------
+Domain1.com/sale.*
+Domain2.com/flytickets.*
-.
------------------------------------------
hth

>
>
> Thank's Again.
>
>
>
> Danilo Fernandes
>
>

thanks,
Tejas Patil