You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Zhaidarbek Ayazbayev <zh...@gmail.com> on 2011/06/06 07:47:11 UTC

Custom seed source

Dear nutch developers,

Nutch by default takes seeds from file system (-seedDir). Is it possible to
change it to take seeds from mysql table?
Is "Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)" the right
extension point to implement my plugin for this?

Regards,
Zhaidarbek Ayazbayev

Re: Custom seed source

Posted by Fyodor Yarochkin <fy...@armorize.com>.
On Wed, Jun 8, 2011 at 4:11 AM, Markus Jelsma
<ma...@openindex.io> wrote:
>> Dear nutch developers,
>>
>> Nutch by default takes seeds from file system (-seedDir). Is it possible to
>> change it to take seeds from mysql table?
>
> In theory, yes, but i would not recommend it. It would be quite a job to make
> the mapper nicely play with database queries.

as another option, you can implement a script to sync the -seedDir
folder and your mysql table.

Re: Custom seed source

Posted by Markus Jelsma <ma...@openindex.io>.
> Dear nutch developers,
> 
> Nutch by default takes seeds from file system (-seedDir). Is it possible to
> change it to take seeds from mysql table?

In theory, yes, but i would not recommend it. It would be quite a job to make 
the mapper nicely play with database queries.


> Is "Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)" the right
> extension point to implement my plugin for this?

No, this is something else. It normalizes URL's to a format you accept such as 
adding trailing slashes (or not) or removing double occurences of certain 
characters.

http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/crawl/Injector.java?view=markup

> 
> Regards,
> Zhaidarbek Ayazbayev