You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Gal Nitzan (JIRA)" <ji...@apache.org> on 2005/09/29 22:38:47 UTC

[jira] Created: (NUTCH-100) New plugin urlfilter-db

New plugin urlfilter-db
-----------------------

         Key: NUTCH-100
         URL: http://issues.apache.org/jira/browse/NUTCH-100
     Project: Nutch
        Type: New Feature
  Components: fetcher  
    Versions: 0.8-dev    
 Environment: MapRed
    Reporter: Gal Nitzan
    Priority: Trivial


Hi,

I have written (not much) a new plugin, based on the URLFilter interface: urlfilter-db .

The purpose of this plugin is to filter domains, i.e. I would like to crawl the world but to fetch only certain domains.

The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on the back-end a database.

For each url
   filter is called
end for

filter
 get the domain name from url
  call cache.get domain
  if not in cache try the database
  if in database cache it and return it
  return null
end filter


The plugin reads the cache size, jdbc driver, connection string, table to use and domain field from nutch-site.xml


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-100) New plugin urlfilter-db

Posted by "Fuad Efendi (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-100?page=comments#action_12367706 ] 

Fuad Efendi commented on NUTCH-100:
-----------------------------------

Please avoid this:

  public void finalize() throws Throwable {
    cleanup();
  }

- In case of an Exception, GC will ignore Throwable, and this object won't be garbage collected...


> New plugin urlfilter-db
> -----------------------
>
>          Key: NUTCH-100
>          URL: http://issues.apache.org/jira/browse/NUTCH-100
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.8-dev
>  Environment: All Nutch versions
>     Reporter: Gal Nitzan
>     Priority: Trivial
>  Attachments: AddedDbURLFilter.patch, urlfilter-db.tar.gz, urlfilter-db.tar.gz
>
> Hi,
> I have written a new plugin, based on the URLFilter interface: urlfilter-db .
> The purpose of this plugin is to filter domains, i.e. I would like to crawl the world but to fetch only certain domains.
> The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on the back-end a database.
> For each url
>    filter is called
> end for
> filter
>  get the domain name from url
>   call cache.get domain
>   if not in cache try the database
>   if in database cache it and return it
>   return null
> end filter
> The plugin reads the cache size, jdbc driver, connection string, table to use and domain field from nutch-site.xml

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-100) New plugin urlfilter-db

Posted by "Gal Nitzan (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-100?page=all ]

Gal Nitzan updated NUTCH-100:
-----------------------------

    Attachment: urlfilter-db.tar.gz

The plugin. Extract, and in myplugin folder read README

> New plugin urlfilter-db
> -----------------------
>
>          Key: NUTCH-100
>          URL: http://issues.apache.org/jira/browse/NUTCH-100
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher
>     Versions: 0.8-dev
>  Environment: MapRed
>     Reporter: Gal Nitzan
>     Priority: Trivial
>  Attachments: urlfilter-db.tar.gz
>
> Hi,
> I have written (not much) a new plugin, based on the URLFilter interface: urlfilter-db .
> The purpose of this plugin is to filter domains, i.e. I would like to crawl the world but to fetch only certain domains.
> The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on the back-end a database.
> For each url
>    filter is called
> end for
> filter
>  get the domain name from url
>   call cache.get domain
>   if not in cache try the database
>   if in database cache it and return it
>   return null
> end filter
> The plugin reads the cache size, jdbc driver, connection string, table to use and domain field from nutch-site.xml

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

Posted by Andrzej Bialecki <ab...@getopt.org>.

Doug Cutting wrote:
> Andrzej Bialecki wrote:
> 
>> 100k regexps is still alot, so I'm not totally sure it would be much 
>> faster, but perhaps worth checking.
> 
> 
> I have worked with this type of technology before (minimized, 
> determinized FSAs, constructed from large sets of strings & expressions) 
> and it should be very fast to perform lookups, even in large, complex 
> FSAs.  Construction of the FSA can be time consuming and should probably 
> be done offline, not at fetcher startup time, so that it is only 
> performed once for a number of fetcher runs.

Guess what... this library supports (de)serialization of automata, so 
they can be compiled once, and then just stored/loaded.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

Posted by Doug Cutting <cu...@nutch.org>.

Andrzej Bialecki wrote:
> 100k regexps is still alot, so I'm not totally sure it would be much 
> faster, but perhaps worth checking.

I have worked with this type of technology before (minimized, 
determinized FSAs, constructed from large sets of strings & expressions) 
and it should be very fast to perform lookups, even in large, complex 
FSAs.  Construction of the FSA can be time consuming and should probably 
be done offline, not at fetcher startup time, so that it is only 
performed once for a number of fetcher runs.

Doug

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

Posted by Andrzej Bialecki <ab...@getopt.org>.

Gal Nitzan wrote:
> Hi Andrzej,
> 
> Yes, it seems like a good option. However, it is GPL, and I noticed in 
> one of the posts that this license is no good for apach.org :).

If you refer to the bricks automata library, it's BSD-licensed.  I 
mentioned in one of the posts that the Innovation httpclient is L-GPL, 
and hence not acceptable for apache.org.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

Posted by Gal Nitzan <gn...@usa.net>.

Hi Andrzej,

Yes, it seems like a good option. However, it is GPL, and I noticed in 
one of the posts that this license is no good for apach.org :).

Regards,

Gal


Andrzej Bialecki wrote:
> Gal Nitzan wrote:
>> Hi,
>>
>> Well, the reason for this plugin is that i wish to crawl many sites 
>> but they all must be in my list. If it was implemented with regular 
>> expressions, the filter would still have to loop 100K expressions on 
>> each url for a match right?
>
> No, that's the whole point - using the library I mentioned you can 
> build a _single_ finite state automaton from all expressions. No 
> looping, just traversing a tree (or whatever equivalent structure they 
> use).
>
> 100k regexps is still alot, so I'm not totally sure it would be much 
> faster, but perhaps worth checking.
>

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

Posted by Andrzej Bialecki <ab...@getopt.org>.

Gal Nitzan wrote:
> Hi,
> 
> Well, the reason for this plugin is that i wish to crawl many sites but 
> they all must be in my list. If it was implemented with regular 
> expressions, the filter would still have to loop 100K expressions on 
> each url for a match right?

No, that's the whole point - using the library I mentioned you can build 
a _single_ finite state automaton from all expressions. No looping, just 
traversing a tree (or whatever equivalent structure they use).

100k regexps is still alot, so I'm not totally sure it would be much 
faster, but perhaps worth checking.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

Posted by Gal Nitzan <gn...@usa.net>.

Hi,

Well, the reason for this plugin is that i wish to crawl many sites but 
they all must be in my list. If it was implemented with regular 
expressions, the filter would still have to loop 100K expressions on 
each url for a match right?

Regards,

Gal

Andrzej Bialecki wrote:
> ogjunk-nutch@yahoo.com wrote:
>> Hi Gal,
>>
>> I'm curious about the memory consumption of the cache and the speed of
>> retrieval of an item from the cache, when the cache has 100k domains in
>> it.
>
> Slightly off-topic, but I hope this is relevant to the original reason 
> for creating this plugin...
>
> There is a BSD-licensed library that implements a large subset of 
> regexps, which is based on finite automata. It is reported to be 
> scalable and very fast (benchmarks are surely impressive):
>
>     http://www.brics.dk/~amoeller/automaton/
>
> I suggest to do some tests with 100k regexps and see if it survives.
>
>

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

Posted by Andrzej Bialecki <ab...@getopt.org>.

ogjunk-nutch@yahoo.com wrote:
> Hi Gal,
> 
> I'm curious about the memory consumption of the cache and the speed of
> retrieval of an item from the cache, when the cache has 100k domains in
> it.

Slightly off-topic, but I hope this is relevant to the original reason 
for creating this plugin...

There is a BSD-licensed library that implements a large subset of 
regexps, which is based on finite automata. It is reported to be 
scalable and very fast (benchmarks are surely impressive):

	http://www.brics.dk/~amoeller/automaton/

I suggest to do some tests with 100k regexps and see if it survives.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

Posted by Gal Nitzan <gn...@usa.net>.

Hi Otis,

I have only a few thousands urls in my db at the moment. However, for a 
100K it should be about 600-800KB. I do not cache the url itself, only a 
hash string. So the next time a url is searched in the cache if the hash 
exists than it is allowed.

Regards,

Gal

ogjunk-nutch@yahoo.com wrote:
> Hi Gal,
>
> I'm curious about the memory consumption of the cache and the speed of
> retrieval of an item from the cache, when the cache has 100k domains in
> it.
>
> Thanks,
> Otis
>
>
> --- Gal Nitzan <gn...@usa.net> wrote:
>
>   
>> Hi Michael,
>>
>> At the moment I have about 3000 domains in my db. I didn't time the 
>> performance however having even 100k domains shouldn't have an impact
>>
>> since it is fetched only once from the database to the cache. A
>> little 
>> performance hit should be over 100k (depends on number elements
>> defined 
>> in xml file).
>>
>> After a few birth problems, the plugin works nicely and I do not feel
>>
>> any impact.
>>
>> Regards,
>>
>> Gal
>>
>>
>> Michael Ji wrote:
>>     
>>> hi,
>>>
>>> How is performance concern if the size of domain list
>>> reaches 10,000?
>>>
>>> Micheal Ji,
>>>
>>> --- "Gal Nitzan (JIRA)" <ji...@apache.org> wrote:
>>>
>>>   
>>>       
>>>>      [
>>>>
>>>>     
>>>>         
>>> http://issues.apache.org/jira/browse/NUTCH-100?page=all
>>>   
>>>       
>>>> ]
>>>>
>>>> Gal Nitzan updated NUTCH-100:
>>>> -----------------------------
>>>>
>>>>            type: Improvement  (was: New Feature)
>>>>     Description: 
>>>> Hi,
>>>>
>>>> I have written a new plugin, based on the URLFilter
>>>> interface: urlfilter-db .
>>>>
>>>> The purpose of this plugin is to filter domains,
>>>> i.e. I would like to crawl the world but to fetch
>>>> only certain domains.
>>>>
>>>> The plugin uses a caching system (SwarmCache, easier
>>>> to deploy than JCS) and on the back-end a database.
>>>>
>>>> For each url
>>>>    filter is called
>>>> end for
>>>>
>>>> filter
>>>>  get the domain name from url
>>>>   call cache.get domain
>>>>   if not in cache try the database
>>>>   if in database cache it and return it
>>>>   return null
>>>> end filter
>>>>
>>>>
>>>> The plugin reads the cache size, jdbc driver,
>>>> connection string, table to use and domain field
>>>> from nutch-site.xml
>>>>
>>>>
>>>>   was:
>>>> Hi,
>>>>
>>>> I have written (not much) a new plugin, based on the
>>>> URLFilter interface: urlfilter-db .
>>>>
>>>> The purpose of this plugin is to filter domains,
>>>> i.e. I would like to crawl the world but to fetch
>>>> only certain domains.
>>>>
>>>> The plugin uses a caching system (SwarmCache, easier
>>>> to deploy than JCS) and on the back-end a database.
>>>>
>>>> For each url
>>>>    filter is called
>>>> end for
>>>>
>>>> filter
>>>>  get the domain name from url
>>>>   call cache.get domain
>>>>   if not in cache try the database
>>>>   if in database cache it and return it
>>>>   return null
>>>> end filter
>>>>
>>>>
>>>> The plugin reads the cache size, jdbc driver,
>>>> connection string, table to use and domain field
>>>> from nutch-site.xml
>>>>
>>>>
>>>>     Environment: All Nutch versions  (was: MapRed)
>>>>
>>>> Fixed some issues
>>>> clean up
>>>> Added a patch for Subversion
>>>>
>>>>     
>>>>         
>>>>> New plugin urlfilter-db
>>>>> -----------------------
>>>>>
>>>>>          Key: NUTCH-100
>>>>>          URL:
>>>>>       
>>>>>           
>>>> http://issues.apache.org/jira/browse/NUTCH-100
>>>>     
>>>>         
>>>>>      Project: Nutch
>>>>>         Type: Improvement
>>>>>   Components: fetcher
>>>>>     Versions: 0.8-dev
>>>>>  Environment: All Nutch versions
>>>>>     Reporter: Gal Nitzan
>>>>>     Priority: Trivial
>>>>>  Attachments: AddedDbURLFilter.patch,
>>>>>       
>>>>>           
>>>> urlfilter-db.tar.gz, urlfilter-db.tar.gz
>>>>     
>>>>         
>>>>> Hi,
>>>>> I have written a new plugin, based on the
>>>>>       
>>>>>           
>>>> URLFilter interface: urlfilter-db .
>>>>     
>>>>         
>>>>> The purpose of this plugin is to filter domains,
>>>>>       
>>>>>           
>>>> i.e. I would like to crawl the world but to fetch
>>>> only certain domains.
>>>>     
>>>>         
>>>>> The plugin uses a caching system (SwarmCache,
>>>>>       
>>>>>           
>>>> easier to deploy than JCS) and on the back-end a
>>>> database.
>>>>     
>>>>         
>>>>> For each url
>>>>>    filter is called
>>>>> end for
>>>>> filter
>>>>>  get the domain name from url
>>>>>   call cache.get domain
>>>>>   if not in cache try the database
>>>>>   if in database cache it and return it
>>>>>   return null
>>>>> end filter
>>>>> The plugin reads the cache size, jdbc driver,
>>>>>       
>>>>>           
>>>> connection string, table to use and domain field
>>>> from nutch-site.xml
>>>>
>>>> -- 
>>>> This message is automatically generated by JIRA.
>>>> -
>>>> If you think it was sent incorrectly contact one of
>>>> the administrators:
>>>>   
>>>>
>>>>     
>>>>         
>>> http://issues.apache.org/jira/secure/Administrators.jspa
>>>   
>>>       
>>>> -
>>>> For more information on JIRA, see:
>>>>    http://www.atlassian.com/software/jira
>>>>
>>>>
>>>>     
>>>>         
>>>
>>> 		
>>> __________________________________ 
>>> Yahoo! Music Unlimited 
>>> Access over 1 million songs. Try it free.
>>> http://music.yahoo.com/unlimited/
>>>
>>> .
>>>
>>>   
>>>       
>>
>>     
>
>
> .
>
>

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

Posted by og...@yahoo.com.

Hi Gal,

I'm curious about the memory consumption of the cache and the speed of
retrieval of an item from the cache, when the cache has 100k domains in
it.

Thanks,
Otis


--- Gal Nitzan <gn...@usa.net> wrote:

> Hi Michael,
> 
> At the moment I have about 3000 domains in my db. I didn't time the 
> performance however having even 100k domains shouldn't have an impact
> 
> since it is fetched only once from the database to the cache. A
> little 
> performance hit should be over 100k (depends on number elements
> defined 
> in xml file).
> 
> After a few birth problems, the plugin works nicely and I do not feel
> 
> any impact.
> 
> Regards,
> 
> Gal
> 
> 
> Michael Ji wrote:
> > hi,
> >
> > How is performance concern if the size of domain list
> > reaches 10,000?
> >
> > Micheal Ji,
> >
> > --- "Gal Nitzan (JIRA)" <ji...@apache.org> wrote:
> >
> >   
> >>      [
> >>
> >>     
> > http://issues.apache.org/jira/browse/NUTCH-100?page=all
> >   
> >> ]
> >>
> >> Gal Nitzan updated NUTCH-100:
> >> -----------------------------
> >>
> >>            type: Improvement  (was: New Feature)
> >>     Description: 
> >> Hi,
> >>
> >> I have written a new plugin, based on the URLFilter
> >> interface: urlfilter-db .
> >>
> >> The purpose of this plugin is to filter domains,
> >> i.e. I would like to crawl the world but to fetch
> >> only certain domains.
> >>
> >> The plugin uses a caching system (SwarmCache, easier
> >> to deploy than JCS) and on the back-end a database.
> >>
> >> For each url
> >>    filter is called
> >> end for
> >>
> >> filter
> >>  get the domain name from url
> >>   call cache.get domain
> >>   if not in cache try the database
> >>   if in database cache it and return it
> >>   return null
> >> end filter
> >>
> >>
> >> The plugin reads the cache size, jdbc driver,
> >> connection string, table to use and domain field
> >> from nutch-site.xml
> >>
> >>
> >>   was:
> >> Hi,
> >>
> >> I have written (not much) a new plugin, based on the
> >> URLFilter interface: urlfilter-db .
> >>
> >> The purpose of this plugin is to filter domains,
> >> i.e. I would like to crawl the world but to fetch
> >> only certain domains.
> >>
> >> The plugin uses a caching system (SwarmCache, easier
> >> to deploy than JCS) and on the back-end a database.
> >>
> >> For each url
> >>    filter is called
> >> end for
> >>
> >> filter
> >>  get the domain name from url
> >>   call cache.get domain
> >>   if not in cache try the database
> >>   if in database cache it and return it
> >>   return null
> >> end filter
> >>
> >>
> >> The plugin reads the cache size, jdbc driver,
> >> connection string, table to use and domain field
> >> from nutch-site.xml
> >>
> >>
> >>     Environment: All Nutch versions  (was: MapRed)
> >>
> >> Fixed some issues
> >> clean up
> >> Added a patch for Subversion
> >>
> >>     
> >>> New plugin urlfilter-db
> >>> -----------------------
> >>>
> >>>          Key: NUTCH-100
> >>>          URL:
> >>>       
> >> http://issues.apache.org/jira/browse/NUTCH-100
> >>     
> >>>      Project: Nutch
> >>>         Type: Improvement
> >>>   Components: fetcher
> >>>     Versions: 0.8-dev
> >>>  Environment: All Nutch versions
> >>>     Reporter: Gal Nitzan
> >>>     Priority: Trivial
> >>>  Attachments: AddedDbURLFilter.patch,
> >>>       
> >> urlfilter-db.tar.gz, urlfilter-db.tar.gz
> >>     
> >>> Hi,
> >>> I have written a new plugin, based on the
> >>>       
> >> URLFilter interface: urlfilter-db .
> >>     
> >>> The purpose of this plugin is to filter domains,
> >>>       
> >> i.e. I would like to crawl the world but to fetch
> >> only certain domains.
> >>     
> >>> The plugin uses a caching system (SwarmCache,
> >>>       
> >> easier to deploy than JCS) and on the back-end a
> >> database.
> >>     
> >>> For each url
> >>>    filter is called
> >>> end for
> >>> filter
> >>>  get the domain name from url
> >>>   call cache.get domain
> >>>   if not in cache try the database
> >>>   if in database cache it and return it
> >>>   return null
> >>> end filter
> >>> The plugin reads the cache size, jdbc driver,
> >>>       
> >> connection string, table to use and domain field
> >> from nutch-site.xml
> >>
> >> -- 
> >> This message is automatically generated by JIRA.
> >> -
> >> If you think it was sent incorrectly contact one of
> >> the administrators:
> >>   
> >>
> >>     
> > http://issues.apache.org/jira/secure/Administrators.jspa
> >   
> >> -
> >> For more information on JIRA, see:
> >>    http://www.atlassian.com/software/jira
> >>
> >>
> >>     
> >
> >
> >
> > 		
> > __________________________________ 
> > Yahoo! Music Unlimited 
> > Access over 1 million songs. Try it free.
> > http://music.yahoo.com/unlimited/
> >
> > .
> >
> >   
> 
> 
>

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

Posted by Gal Nitzan <gn...@usa.net>.

Hi Michael,

At the moment I have about 3000 domains in my db. I didn't time the 
performance however having even 100k domains shouldn't have an impact 
since it is fetched only once from the database to the cache. A little 
performance hit should be over 100k (depends on number elements defined 
in xml file).

After a few birth problems, the plugin works nicely and I do not feel 
any impact.

Regards,

Gal


Michael Ji wrote:
> hi,
>
> How is performance concern if the size of domain list
> reaches 10,000?
>
> Micheal Ji,
>
> --- "Gal Nitzan (JIRA)" <ji...@apache.org> wrote:
>
>   
>>      [
>>
>>     
> http://issues.apache.org/jira/browse/NUTCH-100?page=all
>   
>> ]
>>
>> Gal Nitzan updated NUTCH-100:
>> -----------------------------
>>
>>            type: Improvement  (was: New Feature)
>>     Description: 
>> Hi,
>>
>> I have written a new plugin, based on the URLFilter
>> interface: urlfilter-db .
>>
>> The purpose of this plugin is to filter domains,
>> i.e. I would like to crawl the world but to fetch
>> only certain domains.
>>
>> The plugin uses a caching system (SwarmCache, easier
>> to deploy than JCS) and on the back-end a database.
>>
>> For each url
>>    filter is called
>> end for
>>
>> filter
>>  get the domain name from url
>>   call cache.get domain
>>   if not in cache try the database
>>   if in database cache it and return it
>>   return null
>> end filter
>>
>>
>> The plugin reads the cache size, jdbc driver,
>> connection string, table to use and domain field
>> from nutch-site.xml
>>
>>
>>   was:
>> Hi,
>>
>> I have written (not much) a new plugin, based on the
>> URLFilter interface: urlfilter-db .
>>
>> The purpose of this plugin is to filter domains,
>> i.e. I would like to crawl the world but to fetch
>> only certain domains.
>>
>> The plugin uses a caching system (SwarmCache, easier
>> to deploy than JCS) and on the back-end a database.
>>
>> For each url
>>    filter is called
>> end for
>>
>> filter
>>  get the domain name from url
>>   call cache.get domain
>>   if not in cache try the database
>>   if in database cache it and return it
>>   return null
>> end filter
>>
>>
>> The plugin reads the cache size, jdbc driver,
>> connection string, table to use and domain field
>> from nutch-site.xml
>>
>>
>>     Environment: All Nutch versions  (was: MapRed)
>>
>> Fixed some issues
>> clean up
>> Added a patch for Subversion
>>
>>     
>>> New plugin urlfilter-db
>>> -----------------------
>>>
>>>          Key: NUTCH-100
>>>          URL:
>>>       
>> http://issues.apache.org/jira/browse/NUTCH-100
>>     
>>>      Project: Nutch
>>>         Type: Improvement
>>>   Components: fetcher
>>>     Versions: 0.8-dev
>>>  Environment: All Nutch versions
>>>     Reporter: Gal Nitzan
>>>     Priority: Trivial
>>>  Attachments: AddedDbURLFilter.patch,
>>>       
>> urlfilter-db.tar.gz, urlfilter-db.tar.gz
>>     
>>> Hi,
>>> I have written a new plugin, based on the
>>>       
>> URLFilter interface: urlfilter-db .
>>     
>>> The purpose of this plugin is to filter domains,
>>>       
>> i.e. I would like to crawl the world but to fetch
>> only certain domains.
>>     
>>> The plugin uses a caching system (SwarmCache,
>>>       
>> easier to deploy than JCS) and on the back-end a
>> database.
>>     
>>> For each url
>>>    filter is called
>>> end for
>>> filter
>>>  get the domain name from url
>>>   call cache.get domain
>>>   if not in cache try the database
>>>   if in database cache it and return it
>>>   return null
>>> end filter
>>> The plugin reads the cache size, jdbc driver,
>>>       
>> connection string, table to use and domain field
>> from nutch-site.xml
>>
>> -- 
>> This message is automatically generated by JIRA.
>> -
>> If you think it was sent incorrectly contact one of
>> the administrators:
>>   
>>
>>     
> http://issues.apache.org/jira/secure/Administrators.jspa
>   
>> -
>> For more information on JIRA, see:
>>    http://www.atlassian.com/software/jira
>>
>>
>>     
>
>
>
> 		
> __________________________________ 
> Yahoo! Music Unlimited 
> Access over 1 million songs. Try it free.
> http://music.yahoo.com/unlimited/
>
> .
>
>

Re: [jira] Updated: (NUTCH-100) New plugin urlfilter-db

Posted by Michael Ji <fj...@yahoo.com>.

hi,

How is performance concern if the size of domain list
reaches 10,000?

Micheal Ji,

--- "Gal Nitzan (JIRA)" <ji...@apache.org> wrote:

>      [
>
http://issues.apache.org/jira/browse/NUTCH-100?page=all
> ]
> 
> Gal Nitzan updated NUTCH-100:
> -----------------------------
> 
>            type: Improvement  (was: New Feature)
>     Description: 
> Hi,
> 
> I have written a new plugin, based on the URLFilter
> interface: urlfilter-db .
> 
> The purpose of this plugin is to filter domains,
> i.e. I would like to crawl the world but to fetch
> only certain domains.
> 
> The plugin uses a caching system (SwarmCache, easier
> to deploy than JCS) and on the back-end a database.
> 
> For each url
>    filter is called
> end for
> 
> filter
>  get the domain name from url
>   call cache.get domain
>   if not in cache try the database
>   if in database cache it and return it
>   return null
> end filter
> 
> 
> The plugin reads the cache size, jdbc driver,
> connection string, table to use and domain field
> from nutch-site.xml
> 
> 
>   was:
> Hi,
> 
> I have written (not much) a new plugin, based on the
> URLFilter interface: urlfilter-db .
> 
> The purpose of this plugin is to filter domains,
> i.e. I would like to crawl the world but to fetch
> only certain domains.
> 
> The plugin uses a caching system (SwarmCache, easier
> to deploy than JCS) and on the back-end a database.
> 
> For each url
>    filter is called
> end for
> 
> filter
>  get the domain name from url
>   call cache.get domain
>   if not in cache try the database
>   if in database cache it and return it
>   return null
> end filter
> 
> 
> The plugin reads the cache size, jdbc driver,
> connection string, table to use and domain field
> from nutch-site.xml
> 
> 
>     Environment: All Nutch versions  (was: MapRed)
> 
> Fixed some issues
> clean up
> Added a patch for Subversion
> 
> > New plugin urlfilter-db
> > -----------------------
> >
> >          Key: NUTCH-100
> >          URL:
> http://issues.apache.org/jira/browse/NUTCH-100
> >      Project: Nutch
> >         Type: Improvement
> >   Components: fetcher
> >     Versions: 0.8-dev
> >  Environment: All Nutch versions
> >     Reporter: Gal Nitzan
> >     Priority: Trivial
> >  Attachments: AddedDbURLFilter.patch,
> urlfilter-db.tar.gz, urlfilter-db.tar.gz
> >
> > Hi,
> > I have written a new plugin, based on the
> URLFilter interface: urlfilter-db .
> > The purpose of this plugin is to filter domains,
> i.e. I would like to crawl the world but to fetch
> only certain domains.
> > The plugin uses a caching system (SwarmCache,
> easier to deploy than JCS) and on the back-end a
> database.
> > For each url
> >    filter is called
> > end for
> > filter
> >  get the domain name from url
> >   call cache.get domain
> >   if not in cache try the database
> >   if in database cache it and return it
> >   return null
> > end filter
> > The plugin reads the cache size, jdbc driver,
> connection string, table to use and domain field
> from nutch-site.xml
> 
> -- 
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of
> the administrators:
>   
>
http://issues.apache.org/jira/secure/Administrators.jspa
> -
> For more information on JIRA, see:
>    http://www.atlassian.com/software/jira
> 
> 



		
__________________________________ 
Yahoo! Music Unlimited 
Access over 1 million songs. Try it free.
http://music.yahoo.com/unlimited/

[jira] Updated: (NUTCH-100) New plugin urlfilter-db

Posted by "Gal Nitzan (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-100?page=all ]

Gal Nitzan updated NUTCH-100:
-----------------------------

           type: Improvement  (was: New Feature)
    Description: 
Hi,

I have written a new plugin, based on the URLFilter interface: urlfilter-db .

The purpose of this plugin is to filter domains, i.e. I would like to crawl the world but to fetch only certain domains.

The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on the back-end a database.

For each url
   filter is called
end for

filter
 get the domain name from url
  call cache.get domain
  if not in cache try the database
  if in database cache it and return it
  return null
end filter


The plugin reads the cache size, jdbc driver, connection string, table to use and domain field from nutch-site.xml


  was:
Hi,

I have written (not much) a new plugin, based on the URLFilter interface: urlfilter-db .

The purpose of this plugin is to filter domains, i.e. I would like to crawl the world but to fetch only certain domains.

The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on the back-end a database.

For each url
   filter is called
end for

filter
 get the domain name from url
  call cache.get domain
  if not in cache try the database
  if in database cache it and return it
  return null
end filter


The plugin reads the cache size, jdbc driver, connection string, table to use and domain field from nutch-site.xml


    Environment: All Nutch versions  (was: MapRed)

Fixed some issues
clean up
Added a patch for Subversion

> New plugin urlfilter-db
> -----------------------
>
>          Key: NUTCH-100
>          URL: http://issues.apache.org/jira/browse/NUTCH-100
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.8-dev
>  Environment: All Nutch versions
>     Reporter: Gal Nitzan
>     Priority: Trivial
>  Attachments: AddedDbURLFilter.patch, urlfilter-db.tar.gz, urlfilter-db.tar.gz
>
> Hi,
> I have written a new plugin, based on the URLFilter interface: urlfilter-db .
> The purpose of this plugin is to filter domains, i.e. I would like to crawl the world but to fetch only certain domains.
> The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on the back-end a database.
> For each url
>    filter is called
> end for
> filter
>  get the domain name from url
>   call cache.get domain
>   if not in cache try the database
>   if in database cache it and return it
>   return null
> end filter
> The plugin reads the cache size, jdbc driver, connection string, table to use and domain field from nutch-site.xml

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-100) New plugin urlfilter-db

Posted by "Fuad Efendi (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-100?page=comments#action_12367731 ] 

Fuad Efendi commented on NUTCH-100:
-----------------------------------

Sorry, should be [&&] instead of [||] in previous comment



> New plugin urlfilter-db
> -----------------------
>
>          Key: NUTCH-100
>          URL: http://issues.apache.org/jira/browse/NUTCH-100
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.8-dev
>  Environment: All Nutch versions
>     Reporter: Gal Nitzan
>     Priority: Trivial
>  Attachments: AddedDbURLFilter.patch, urlfilter-db.tar.gz, urlfilter-db.tar.gz
>
> Hi,
> I have written a new plugin, based on the URLFilter interface: urlfilter-db .
> The purpose of this plugin is to filter domains, i.e. I would like to crawl the world but to fetch only certain domains.
> The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on the back-end a database.
> For each url
>    filter is called
> end for
> filter
>  get the domain name from url
>   call cache.get domain
>   if not in cache try the database
>   if in database cache it and return it
>   return null
> end filter
> The plugin reads the cache size, jdbc driver, connection string, table to use and domain field from nutch-site.xml

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-100) New plugin urlfilter-db

Posted by "Gal Nitzan (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-100?page=all ]

Gal Nitzan updated NUTCH-100:
-----------------------------

    Attachment: urlfilter-db.tar.gz
                AddedDbURLFilter.patch

Fixed some issue with swarm cache (removed loading as daemon).
Code cleanup and remarks
Added some logging

> New plugin urlfilter-db
> -----------------------
>
>          Key: NUTCH-100
>          URL: http://issues.apache.org/jira/browse/NUTCH-100
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher
>     Versions: 0.8-dev
>  Environment: MapRed
>     Reporter: Gal Nitzan
>     Priority: Trivial
>  Attachments: AddedDbURLFilter.patch, urlfilter-db.tar.gz, urlfilter-db.tar.gz
>
> Hi,
> I have written (not much) a new plugin, based on the URLFilter interface: urlfilter-db .
> The purpose of this plugin is to filter domains, i.e. I would like to crawl the world but to fetch only certain domains.
> The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on the back-end a database.
> For each url
>    filter is called
> end for
> filter
>  get the domain name from url
>   call cache.get domain
>   if not in cache try the database
>   if in database cache it and return it
>   return null
> end filter
> The plugin reads the cache size, jdbc driver, connection string, table to use and domain field from nutch-site.xml

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-100) New plugin urlfilter-db

Posted by "Fuad Efendi (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-100?page=comments#action_12367721 ] 

Fuad Efendi commented on NUTCH-100:
-----------------------------------

Please, add port number:
if (u.getPort()!=-1 || u.getPort()!=80) {
   ret = ret + ":" + u.getPort();
}


> New plugin urlfilter-db
> -----------------------
>
>          Key: NUTCH-100
>          URL: http://issues.apache.org/jira/browse/NUTCH-100
>      Project: Nutch
>         Type: Improvement
>   Components: fetcher
>     Versions: 0.8-dev
>  Environment: All Nutch versions
>     Reporter: Gal Nitzan
>     Priority: Trivial
>  Attachments: AddedDbURLFilter.patch, urlfilter-db.tar.gz, urlfilter-db.tar.gz
>
> Hi,
> I have written a new plugin, based on the URLFilter interface: urlfilter-db .
> The purpose of this plugin is to filter domains, i.e. I would like to crawl the world but to fetch only certain domains.
> The plugin uses a caching system (SwarmCache, easier to deploy than JCS) and on the back-end a database.
> For each url
>    filter is called
> end for
> filter
>  get the domain name from url
>   call cache.get domain
>   if not in cache try the database
>   if in database cache it and return it
>   return null
> end filter
> The plugin reads the cache size, jdbc driver, connection string, table to use and domain field from nutch-site.xml

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira