You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Lewis John Mcgibbney <le...@gmail.com> on 2016/05/01 21:40:13 UTC

Re: Priorize links in Fetching Step

Hi Yulio,

Marcus wrote the MimeAdaptiveFetchSchedule [0] implementation for exactly
this purpose.
You can utilize it as per [1]


[0]
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/MimeAdaptiveFetchSchedule.java
[1]
https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L487-L492

On Sun, May 1, 2016 at 7:43 AM, <us...@nutch.apache.org> wrote:

> From: Yulio Aleman Jimenez <yu...@uci.cu>
> To: user@nutch.apache.org
> Cc:
> Date: Fri, 29 Apr 2016 16:47:32 -0400 (CDT)
> Subject: Priorize links in Fetching Step
> Hi.
>
> I'm using Nutch 1.9 with Solr 4.10 in a local environment.
> I need a way to priorize some links in the Fetching Steps, through
> filtering the new links identified in the last crawls by some criterias,
> for example the extension of the resource. The goal is priorize images,
> documents, etc, before HTML pages in crawling process.
>
> Is there any property in nutch-site.xml or any plugin capable to do this??
> How can I do this???
>
> I accept any sugestion, or some source code snippets for creating a new
> plugin for nutch.
>
> Best regards
>
> --
> Ing. Yulio Aleman Jimenez
> Dpto. Soluciones Informáticas para Internet. CIDI
> Universidad de las Ciencias Informáticas (UCI)
>
> -----------------------------------------------------------------------------------------------------------------------------------
> "Podrán morir los hombres, PERO JAMÁS SUS IDEAS"
>
>
> La UCI presente este 1ro. de Mayo en la Plaza de la Revolución
> junto a todo el pueblo.¡Por Cuba: Unidad y Compromiso!
>
>
>


-- 
*Lewis*

Re: [MASSMAIL]Re: Priorize links in Fetching Step

Posted by Yulio Aleman Jimenez <yu...@uci.cu>.

Hi Lewis. 

Thanks for your answer, was very helpful; but I believe that these plugins are used to schedule the refetching of URLs that already has fetched and stored in the CrawlDB. 

I need priorize the URLs discovered in crawls and stored in LinkDB (for new crawls) using the extension of the resource; but before they are fetched and stored in the CrawlDB. The MimeType of a resource is identified after it are fetched, this is the reazon because I believe the MimeAdaptiveFetchSchedule doesn't work in this case. 

Imagine this process: 
1- In the first crawl, the seed have 10 URLs of HTML web pages. 
2- In this crawl, 100 new URLs were detected by Nutch. From this quantity, 30 URLs are images, agree with the resource extension. 
3- In the second crawl, Nutch is ready to fetch the 10 URLs of the seed, and the other 100 URLs identified in the previous crawl. But, of all URLs, Nutch is going to priorize the 30 URLs of images and after, the rest of URLs. 

With this strategy, Nutch will ensure the collection of images in first place and faster; also it will continue using the HTML web page for the expansion method on the Web. 

I think that I may use the extension points of the ScoringFilters to write a plugin capable to filter the URLs by extensions and change the score of these to priorize the new URLs in new crawls, agree my convenience. 

Do you have any idea how can I do this??? or Already there are one plugin capable to do this???? 

Thanks a lot. 

----- Mensaje original -----

De: "Lewis John Mcgibbney" <le...@gmail.com> 
Para: user@nutch.apache.org 
Enviados: Domingo, 1 de Mayo 2016 15:40:13 
Asunto: [MASSMAIL]Re: Priorize links in Fetching Step 

Hi Yulio, 

Marcus wrote the MimeAdaptiveFetchSchedule [0] implementation for exactly 
this purpose. 
You can utilize it as per [1] 

[0] 
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/MimeAdaptiveFetchSchedule.java 
[1] 
https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L487-L492 

On Sun, May 1, 2016 at 7:43 AM, <us...@nutch.apache.org> wrote: 

> From: Yulio Aleman Jimenez <yu...@uci.cu> 
> To: user@nutch.apache.org 
> Cc: 
> Date: Fri, 29 Apr 2016 16:47:32 -0400 (CDT) 
> Subject: Priorize links in Fetching Step 
> Hi. 
> 
> I'm using Nutch 1.9 with Solr 4.10 in a local environment. 
> I need a way to priorize some links in the Fetching Steps, through 
> filtering the new links identified in the last crawls by some criterias, 
> for example the extension of the resource. The goal is priorize images, 
> documents, etc, before HTML pages in crawling process. 
> 
> Is there any property in nutch-site.xml or any plugin capable to do this?? 
> How can I do this??? 
> 
> I accept any sugestion, or some source code snippets for creating a new 
> plugin for nutch. 
> 
> Best regards 
> 
> -- 
> Ing. Yulio Aleman Jimenez 
> Dpto. Soluciones Informáticas para Internet. CIDI 
> Universidad de las Ciencias Informáticas (UCI) 
> 
> ----------------------------------------------------------------------------------------------------------------------------------- 
> "Podrán morir los hombres, PERO JAMÁS SUS IDEAS" 
> 
> 
> La UCI presente este 1ro. de Mayo en la Plaza de la Revolución 
> junto a todo el pueblo.¡Por Cuba: Unidad y Compromiso! 
> 
> 
> 

-- 
*Lewis* 

-- 
Ing. Yulio Aleman Jimenez 
Dpto. Soluciones Informáticas para Internet. CIDI 
Universidad de las Ciencias Informáticas (UCI) 
----------------------------------------------------------------------------------------------------------------------------------- 
"Podrán morir los hombres, PERO JAMÁS SUS IDEAS" 

La UCI presente este 1ro. de Mayo en la Plaza de la Revolución
junto a todo el pueblo.¡Por Cuba: Unidad y Compromiso!