You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Eric <er...@lakemeadonline.com> on 2009/10/05 21:27:23 UTC

Targeting Specific Links for Crawling

Does anyone know if it possible to target only certain links for  
crawling dynamically during a crawl? My goal would be to write a  
plugin for this functionality but I don't know where to start.

Thanks,

EO

RE: Targeting Specific Links for Crawling

Posted by BELLINI ADAM <mb...@msn.com>.
but when  you will start by inject your starting point from your seed...after that nutch will fetch urls and it will bypass those filtred by urlfilter (regular expression)...so to calculate the number X of those URLS you have to crawl all your site !!
so for sure if you will not have any regular expression you will have all the links oif your site (with the X needed links), but i guess you wont do that becoz it's a waste of time.
i can see just one solutuion is to well set the urlfilter.txt (with the right regular expression).
anybody hv other ideas ??







> Subject: Re: Targeting Specific Links for Crawling
> From: eric@lakemeadonline.com
> Date: Mon, 5 Oct 2009 13:07:25 -0700
> To: nutch-user@lucene.apache.org
> 
> Adam,
> 
> Yes, I have a list of strings I would look for in the link. My plan is  
> to look for X number of links on the site - First looking for the  
> links I want and if they exist, add them, if they don't  exist add X  
> links from the site. I am planning to start in the URL Filter plugin.
> 
> Eric
> 
> On Oct 5, 2009, at 12:58 PM, BELLINI ADAM wrote:
> 
> >
> >
> >
> > how to target certain links !! do you know how the links are made !?  
> > i mean their format ?
> > you can just set a regular expression to accept only those kind of  
> > links
> >
> >
> >
> >> Date: Mon, 5 Oct 2009 21:39:52 +0200
> >> From: ab@getopt.org
> >> To: nutch-user@lucene.apache.org
> >> Subject: Re: Targeting Specific Links for Crawling
> >>
> >> Eric wrote:
> >>> Does anyone know if it possible to target only certain links for
> >>> crawling dynamically during a crawl? My goal would be to write a  
> >>> plugin
> >>> for this functionality but I don't know where to start.
> >>
> >> URLFilter plugins may be what you want.
> >>
> >>
> >> -- 
> >> Best regards,
> >> Andrzej Bialecki     <><
> >>  ___. ___ ___ ___ _ _   __________________________________
> >> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> >> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> >> http://www.sigram.com  Contact: info at sigram dot com
> >>
> > 		 	   		
> > _________________________________________________________________
> > New: Messenger sign-in on the MSN homepage
> > http://go.microsoft.com/?linkid=9677403
> 
 		 	   		  
_________________________________________________________________
New! Open Messenger faster on the MSN homepage
http://go.microsoft.com/?linkid=9677405

Re: Targeting Specific Links for Crawling

Posted by Eric <er...@lakemeadonline.com>.
Adam,

Yes, I have a list of strings I would look for in the link. My plan is  
to look for X number of links on the site - First looking for the  
links I want and if they exist, add them, if they don't  exist add X  
links from the site. I am planning to start in the URL Filter plugin.

Eric

On Oct 5, 2009, at 12:58 PM, BELLINI ADAM wrote:

>
>
>
> how to target certain links !! do you know how the links are made !?  
> i mean their format ?
> you can just set a regular expression to accept only those kind of  
> links
>
>
>
>> Date: Mon, 5 Oct 2009 21:39:52 +0200
>> From: ab@getopt.org
>> To: nutch-user@lucene.apache.org
>> Subject: Re: Targeting Specific Links for Crawling
>>
>> Eric wrote:
>>> Does anyone know if it possible to target only certain links for
>>> crawling dynamically during a crawl? My goal would be to write a  
>>> plugin
>>> for this functionality but I don't know where to start.
>>
>> URLFilter plugins may be what you want.
>>
>>
>> -- 
>> Best regards,
>> Andrzej Bialecki     <><
>>  ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
> 		 	   		
> _________________________________________________________________
> New: Messenger sign-in on the MSN homepage
> http://go.microsoft.com/?linkid=9677403


RE: Targeting Specific Links for Crawling

Posted by BELLINI ADAM <mb...@msn.com>.


how to target certain links !! do you know how the links are made !? i mean their format ?
you can just set a regular expression to accept only those kind of links 



> Date: Mon, 5 Oct 2009 21:39:52 +0200
> From: ab@getopt.org
> To: nutch-user@lucene.apache.org
> Subject: Re: Targeting Specific Links for Crawling
> 
> Eric wrote:
> > Does anyone know if it possible to target only certain links for 
> > crawling dynamically during a crawl? My goal would be to write a plugin 
> > for this functionality but I don't know where to start.
> 
> URLFilter plugins may be what you want.
> 
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
 		 	   		  
_________________________________________________________________
New: Messenger sign-in on the MSN homepage
http://go.microsoft.com/?linkid=9677403

Re: Targeting Specific Links for Crawling

Posted by Andrzej Bialecki <ab...@getopt.org>.
Eric wrote:
> Does anyone know if it possible to target only certain links for 
> crawling dynamically during a crawl? My goal would be to write a plugin 
> for this functionality but I don't know where to start.

URLFilter plugins may be what you want.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com