You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Diaa Abdallah <di...@gmail.com> on 2014/05/16 09:45:14 UTC

Inject auto generated urls

Hi,
In some cases when you crawl a webpage you already know many page urls that
have a similar structure.

For example in imdb entertainment artists have the following link structure:
http://www.imdb.com/name/nm1/
http://www.imdb.com/name/nm2/
http://www.imdb.com/name/nm6499112/

How about allowing the addition  of urls based on generators?
For example you would define in the url file:
http://www.imdb.com/name/nm{{[1-6499112]}}

where {{ <simple-regex> }} is the place to put a number/letter generator

So that all these urls are injected into nutch?

I could work on that if people are interested.

Regards,
Diaa

Re: Inject auto generated urls

Posted by Frédéric Passaniti <f....@gmail.com>.
It's not exactly the same way to implement it but i'm currently looking for
a way to inject at run time new urls.
my idea was to detect new interesting urls into a custom parser / html
plugin and directly inject urls into the seed list (without having to
restart nutch)



2014-05-16 9:45 GMT+02:00 Diaa Abdallah <di...@gmail.com>:

> Hi,
> In some cases when you crawl a webpage you already know many page urls
> that have a similar structure.
>
> For example in imdb entertainment artists have the following link
> structure:
> http://www.imdb.com/name/nm1/
> http://www.imdb.com/name/nm2/
> http://www.imdb.com/name/nm6499112/
>
> How about allowing the addition  of urls based on generators?
> For example you would define in the url file:
> http://www.imdb.com/name/nm{{[1-6499112]}}
>
> where {{ <simple-regex> }} is the place to put a number/letter generator
>
> So that all these urls are injected into nutch?
>
> I could work on that if people are interested.
>
> Regards,
> Diaa
>
>


-- 
Frédéric Passaniti