You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Gal Nitzan <gn...@usa.net> on 2005/09/29 21:53:42 UTC

New plugin

Hi,

I have written (not much) a new plugin, based on the URLFilter 
interface: urlfilter-db .

The purpose of this plugin is to filter domains, i.e. I would like to 
crawl the world but to fetch only certain domains.

The plugin uses a caching system (SwarmCache, easier to deploy than JCS) 
and on the back-end a database.

For each url
    filter is called
end for

filter
  get the domain name from url
   call cache.get domain
   if not in cache try the database
   if in database cache it and return it
   return null
end filter


The plugin reads the cache size, jdbc driver, connection string, table 
to use and domain field from nutch-site.xml

Since I do not have the tools to add it to the svn and all, If someone 
is interested let me know and I can mail it.

Regards,

Gal

Re: New plugin

Posted by Gal Nitzan <gn...@usa.net>.
John X wrote:
> Hi, Gal,
>
> Yes, I am interested. You can post the tarball to
> http://issues.apache.org/jira/browse/Nutch
>
> Thanks,
>
> John
>
> On Thu, Sep 29, 2005 at 09:53:42PM +0200, Gal Nitzan wrote:
>   
>> Hi,
>>
>> I have written (not much) a new plugin, based on the URLFilter 
>> interface: urlfilter-db .
>>
>> The purpose of this plugin is to filter domains, i.e. I would like to 
>> crawl the world but to fetch only certain domains.
>>
>> The plugin uses a caching system (SwarmCache, easier to deploy than JCS) 
>> and on the back-end a database.
>>
>> For each url
>>    filter is called
>> end for
>>
>> filter
>>  get the domain name from url
>>   call cache.get domain
>>   if not in cache try the database
>>   if in database cache it and return it
>>   return null
>> end filter
>>
>>
>> The plugin reads the cache size, jdbc driver, connection string, table 
>> to use and domain field from nutch-site.xml
>>
>> Since I do not have the tools to add it to the svn and all, If someone 
>> is interested let me know and I can mail it.
>>
>> Regards,
>>
>> Gal
>>
>>     
> __________________________________________
> http://www.neasys.com - A Good Place to Be
> Come to visit us today!
>
> .
>
>   

Done. enjoy: http://issues.apache.org/jira/browse/NUTCH-100

Regards, Gal

Re: New plugin

Posted by John X <jo...@neasys.com>.
Hi, Gal,

Yes, I am interested. You can post the tarball to
http://issues.apache.org/jira/browse/Nutch

Thanks,

John

On Thu, Sep 29, 2005 at 09:53:42PM +0200, Gal Nitzan wrote:
> Hi,
> 
> I have written (not much) a new plugin, based on the URLFilter 
> interface: urlfilter-db .
> 
> The purpose of this plugin is to filter domains, i.e. I would like to 
> crawl the world but to fetch only certain domains.
> 
> The plugin uses a caching system (SwarmCache, easier to deploy than JCS) 
> and on the back-end a database.
> 
> For each url
>    filter is called
> end for
> 
> filter
>  get the domain name from url
>   call cache.get domain
>   if not in cache try the database
>   if in database cache it and return it
>   return null
> end filter
> 
> 
> The plugin reads the cache size, jdbc driver, connection string, table 
> to use and domain field from nutch-site.xml
> 
> Since I do not have the tools to add it to the svn and all, If someone 
> is interested let me know and I can mail it.
> 
> Regards,
> 
> Gal
> 
__________________________________________
http://www.neasys.com - A Good Place to Be
Come to visit us today!