You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by "Philippe EUGENE (JIRA)" <ji...@apache.org> on 2006/01/13 10:19:19 UTC

[jira] Created: (NUTCH-173) PerHost Crawling Policy ( crawl.ignore.external.links )

PerHost Crawling Policy ( crawl.ignore.external.links )
-------------------------------------------------------

Key: NUTCH-173
URL: http://issues.apache.org/jira/browse/NUTCH-173
Project: Nutch
Type: New Feature
Components: fetcher
Versions: 0.7.1, 0.7, 0.8-dev
Reporter: Philippe EUGENE
Priority: Minor

There is two major way of crawl in Nutch.

Intranet Crawl : forbidden all, allow somes few host

Whole-web crawl : allow all, forbidden few thinks

I propose a third type of crawl.

Directory Crawl : The purpose of this crawl is to manage few thousands of host wihtout managing rules pattern in UrlFilterRegexp.

I made two patch for : 0.7, 0.7.1 and 0.8-dev

I propose a new boolean property in nutch-site.xml : crawl.ignore.external.links, with false value at default.
By default this new feature don't modify the behavior of nutch crawler.

When you setup this property to true, the crawler don't fetch external links of the host.
So the crawl is limited to the host that you inject at the beginning at the crawl.

I know there is some proposal of new crawl policy using the CrawlDatum in 0.8-dev branch.
This feature colud be a easiest way to add quickly new crawl feature to nutch, waiting for a best way to improve crawl policy.

I post two patch.
Sorry for my very poor english
--
Philippe

--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-173) PerHost Crawling Policy ( crawl.ignore.external.links )

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-173?page=comments#action_12363309 ] 

Doug Cutting commented on NUTCH-173:
------------------------------------

Couldn't you instead use a prefix-urlfilter generated from your crawl seed?

> PerHost Crawling Policy ( crawl.ignore.external.links )
> -------------------------------------------------------
>
>          Key: NUTCH-173
>          URL: http://issues.apache.org/jira/browse/NUTCH-173
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.8-dev
>     Reporter: Philippe EUGENE
>     Priority: Minor
>  Attachments: patch.txt, patch08.txt
>
> There is two major way of crawl in Nutch.
> Intranet Crawl : forbidden all, allow somes few host
> Whole-web crawl : allow all, forbidden few thinks
> I propose a third type of crawl.
> Directory Crawl : The purpose of this crawl is to manage few thousands of host wihtout managing rules pattern in UrlFilterRegexp.
> I made two patch for : 0.7, 0.7.1 and 0.8-dev
> I propose a new boolean property in nutch-site.xml : crawl.ignore.external.links, with false value at default.
> By default this new feature don't modify the behavior of nutch crawler.
> When you setup this property to true, the crawler don't fetch external links of the host.
> So the crawl is limited to the host that you inject at the beginning at the crawl.
> I know there is some proposal of new crawl policy using the CrawlDatum in 0.8-dev branch. 
> This feature colud be a easiest way to add quickly new crawl feature to nutch, waiting for a best way to improve crawl policy.
> I post two patch.
> Sorry for my very poor english 
> --
> Philippe

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Closed: (NUTCH-173) PerHost Crawling Policy ( crawl.ignore.external.links )

Posted by "Andrzej Bialecki (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-173?page=all ]

Andrzej Bialecki  closed NUTCH-173.
-----------------------------------

    Resolution: Fixed

Patch applied to trunk/ . Thank you!

> PerHost Crawling Policy ( crawl.ignore.external.links )
> -------------------------------------------------------
>
>                 Key: NUTCH-173
>                 URL: http://issues.apache.org/jira/browse/NUTCH-173
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.7.1, 0.7, 0.8-dev
>            Reporter: Philippe EUGENE
>            Priority: Minor
>         Attachments: patch.txt, patch08-new.patch, patch08.txt
>
>
> There is two major way of crawl in Nutch.
> Intranet Crawl : forbidden all, allow somes few host
> Whole-web crawl : allow all, forbidden few thinks
> I propose a third type of crawl.
> Directory Crawl : The purpose of this crawl is to manage few thousands of host wihtout managing rules pattern in UrlFilterRegexp.
> I made two patch for : 0.7, 0.7.1 and 0.8-dev
> I propose a new boolean property in nutch-site.xml : crawl.ignore.external.links, with false value at default.
> By default this new feature don't modify the behavior of nutch crawler.
> When you setup this property to true, the crawler don't fetch external links of the host.
> So the crawl is limited to the host that you inject at the beginning at the crawl.
> I know there is some proposal of new crawl policy using the CrawlDatum in 0.8-dev branch. 
> This feature colud be a easiest way to add quickly new crawl feature to nutch, waiting for a best way to improve crawl policy.
> I post two patch.
> Sorry for my very poor english 
> --
> Philippe

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-173) PerHost Crawling Policy ( crawl.ignore.external.links )

Posted by "Stefan Neufeind (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-173?page=all ]

Stefan Neufeind updated NUTCH-173:
----------------------------------

    Attachment: patch08-new.patch

Here is the 08-patch, corrected to work against nightly from 2006-05-20.
Also fromHost is now only generated if really needed and nutch-default.xml is patched as well. By the way: Where should a property for "crawl" be located in the config-file? In the "fetcher"-section? In that case please somebody move it up/down or rename the property before including it in the dev-tree.

But could somebody please review it quickly? I'm not sure it's 100% correct. Still investigating on my side ...

> PerHost Crawling Policy ( crawl.ignore.external.links )
> -------------------------------------------------------
>
>          Key: NUTCH-173
>          URL: http://issues.apache.org/jira/browse/NUTCH-173
>      Project: Nutch
>         Type: New Feature

>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.8-dev
>     Reporter: Philippe EUGENE
>     Priority: Minor
>  Attachments: patch.txt, patch08-new.patch, patch08.txt
>
> There is two major way of crawl in Nutch.
> Intranet Crawl : forbidden all, allow somes few host
> Whole-web crawl : allow all, forbidden few thinks
> I propose a third type of crawl.
> Directory Crawl : The purpose of this crawl is to manage few thousands of host wihtout managing rules pattern in UrlFilterRegexp.
> I made two patch for : 0.7, 0.7.1 and 0.8-dev
> I propose a new boolean property in nutch-site.xml : crawl.ignore.external.links, with false value at default.
> By default this new feature don't modify the behavior of nutch crawler.
> When you setup this property to true, the crawler don't fetch external links of the host.
> So the crawl is limited to the host that you inject at the beginning at the crawl.
> I know there is some proposal of new crawl policy using the CrawlDatum in 0.8-dev branch. 
> This feature colud be a easiest way to add quickly new crawl feature to nutch, waiting for a best way to improve crawl policy.
> I post two patch.
> Sorry for my very poor english 
> --
> Philippe

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-173) PerHost Crawling Policy ( crawl.ignore.external.links )

Posted by "Philippe EUGENE (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-173?page=all ]

Philippe EUGENE updated NUTCH-173:
----------------------------------

    Attachment: patch.txt

Patch for 0.7 and 0.7.1 version

> PerHost Crawling Policy ( crawl.ignore.external.links )
> -------------------------------------------------------
>
>          Key: NUTCH-173
>          URL: http://issues.apache.org/jira/browse/NUTCH-173
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.8-dev
>     Reporter: Philippe EUGENE
>     Priority: Minor
>  Attachments: patch.txt, patch08.txt
>
> There is two major way of crawl in Nutch.
> Intranet Crawl : forbidden all, allow somes few host
> Whole-web crawl : allow all, forbidden few thinks
> I propose a third type of crawl.
> Directory Crawl : The purpose of this crawl is to manage few thousands of host wihtout managing rules pattern in UrlFilterRegexp.
> I made two patch for : 0.7, 0.7.1 and 0.8-dev
> I propose a new boolean property in nutch-site.xml : crawl.ignore.external.links, with false value at default.
> By default this new feature don't modify the behavior of nutch crawler.
> When you setup this property to true, the crawler don't fetch external links of the host.
> So the crawl is limited to the host that you inject at the beginning at the crawl.
> I know there is some proposal of new crawl policy using the CrawlDatum in 0.8-dev branch. 
> This feature colud be a easiest way to add quickly new crawl feature to nutch, waiting for a best way to improve crawl policy.
> I post two patch.
> Sorry for my very poor english 
> --
> Philippe

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Updated: (NUTCH-173) PerHost Crawling Policy ( crawl.ignore.external.links )

Posted by "Philippe EUGENE (JIRA)" <ji...@apache.org>.

     [ http://issues.apache.org/jira/browse/NUTCH-173?page=all ]

Philippe EUGENE updated NUTCH-173:
----------------------------------

    Attachment: patch08.txt

Patch for 0.8-dev version

> PerHost Crawling Policy ( crawl.ignore.external.links )
> -------------------------------------------------------
>
>          Key: NUTCH-173
>          URL: http://issues.apache.org/jira/browse/NUTCH-173
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.8-dev
>     Reporter: Philippe EUGENE
>     Priority: Minor
>  Attachments: patch.txt, patch08.txt
>
> There is two major way of crawl in Nutch.
> Intranet Crawl : forbidden all, allow somes few host
> Whole-web crawl : allow all, forbidden few thinks
> I propose a third type of crawl.
> Directory Crawl : The purpose of this crawl is to manage few thousands of host wihtout managing rules pattern in UrlFilterRegexp.
> I made two patch for : 0.7, 0.7.1 and 0.8-dev
> I propose a new boolean property in nutch-site.xml : crawl.ignore.external.links, with false value at default.
> By default this new feature don't modify the behavior of nutch crawler.
> When you setup this property to true, the crawler don't fetch external links of the host.
> So the crawl is limited to the host that you inject at the beginning at the crawl.
> I know there is some proposal of new crawl policy using the CrawlDatum in 0.8-dev branch. 
> This feature colud be a easiest way to add quickly new crawl feature to nutch, waiting for a best way to improve crawl policy.
> I post two patch.
> Sorry for my very poor english 
> --
> Philippe

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-173) PerHost Crawling Policy ( crawl.ignore.external.links )

Posted by "Doug Cutting (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-173?page=comments#action_12375421 ] 

Doug Cutting commented on NUTCH-173:
------------------------------------

+1, with a few modifications.

Can you please re-generate this against the current sources?  This patch does not apply for me.

Also, the fromHost should only be computed if crawl.ignore.external.links is true.

Finally, please add an entry to conf/nutch-default.xml for the new parameter in your patch.

Thanks!

> PerHost Crawling Policy ( crawl.ignore.external.links )
> -------------------------------------------------------
>
>          Key: NUTCH-173
>          URL: http://issues.apache.org/jira/browse/NUTCH-173
>      Project: Nutch
>         Type: New Feature

>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.8-dev
>     Reporter: Philippe EUGENE
>     Priority: Minor
>  Attachments: patch.txt, patch08.txt
>
> There is two major way of crawl in Nutch.
> Intranet Crawl : forbidden all, allow somes few host
> Whole-web crawl : allow all, forbidden few thinks
> I propose a third type of crawl.
> Directory Crawl : The purpose of this crawl is to manage few thousands of host wihtout managing rules pattern in UrlFilterRegexp.
> I made two patch for : 0.7, 0.7.1 and 0.8-dev
> I propose a new boolean property in nutch-site.xml : crawl.ignore.external.links, with false value at default.
> By default this new feature don't modify the behavior of nutch crawler.
> When you setup this property to true, the crawler don't fetch external links of the host.
> So the crawl is limited to the host that you inject at the beginning at the crawl.
> I know there is some proposal of new crawl policy using the CrawlDatum in 0.8-dev branch. 
> This feature colud be a easiest way to add quickly new crawl feature to nutch, waiting for a best way to improve crawl policy.
> I post two patch.
> Sorry for my very poor english 
> --
> Philippe

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-173) PerHost Crawling Policy ( crawl.ignore.external.links )

Posted by "Stefan Neufeind (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-173?page=comments#action_12412530 ] 

Stefan Neufeind commented on NUTCH-173:
---------------------------------------

Applies fine and works for me on 0.7.2.

> PerHost Crawling Policy ( crawl.ignore.external.links )
> -------------------------------------------------------
>
>          Key: NUTCH-173
>          URL: http://issues.apache.org/jira/browse/NUTCH-173
>      Project: Nutch
>         Type: New Feature

>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.8-dev
>     Reporter: Philippe EUGENE
>     Priority: Minor
>  Attachments: patch.txt, patch08.txt
>
> There is two major way of crawl in Nutch.
> Intranet Crawl : forbidden all, allow somes few host
> Whole-web crawl : allow all, forbidden few thinks
> I propose a third type of crawl.
> Directory Crawl : The purpose of this crawl is to manage few thousands of host wihtout managing rules pattern in UrlFilterRegexp.
> I made two patch for : 0.7, 0.7.1 and 0.8-dev
> I propose a new boolean property in nutch-site.xml : crawl.ignore.external.links, with false value at default.
> By default this new feature don't modify the behavior of nutch crawler.
> When you setup this property to true, the crawler don't fetch external links of the host.
> So the crawl is limited to the host that you inject at the beginning at the crawl.
> I know there is some proposal of new crawl policy using the CrawlDatum in 0.8-dev branch. 
> This feature colud be a easiest way to add quickly new crawl feature to nutch, waiting for a best way to improve crawl policy.
> I post two patch.
> Sorry for my very poor english 
> --
> Philippe

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-173) PerHost Crawling Policy ( crawl.ignore.external.links )

Posted by "Philippe EUGENE (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-173?page=comments#action_12363807 ] 

Philippe EUGENE commented on NUTCH-173:
---------------------------------------

I have more than 5.000 hosts in my directory. I'm not sure about crawl performance with more than 5.000 rules.
It's easier for me to just manage a boolean value in the nutch conf.
I know this is not the natural way of crawl with Nutch, but it could be interested for somes nutch's user.
The most important problem  : scoring from external links is affected by this patch.


> PerHost Crawling Policy ( crawl.ignore.external.links )
> -------------------------------------------------------
>
>          Key: NUTCH-173
>          URL: http://issues.apache.org/jira/browse/NUTCH-173
>      Project: Nutch
>         Type: New Feature
>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.8-dev
>     Reporter: Philippe EUGENE
>     Priority: Minor
>  Attachments: patch.txt, patch08.txt
>
> There is two major way of crawl in Nutch.
> Intranet Crawl : forbidden all, allow somes few host
> Whole-web crawl : allow all, forbidden few thinks
> I propose a third type of crawl.
> Directory Crawl : The purpose of this crawl is to manage few thousands of host wihtout managing rules pattern in UrlFilterRegexp.
> I made two patch for : 0.7, 0.7.1 and 0.8-dev
> I propose a new boolean property in nutch-site.xml : crawl.ignore.external.links, with false value at default.
> By default this new feature don't modify the behavior of nutch crawler.
> When you setup this property to true, the crawler don't fetch external links of the host.
> So the crawl is limited to the host that you inject at the beginning at the crawl.
> I know there is some proposal of new crawl policy using the CrawlDatum in 0.8-dev branch. 
> This feature colud be a easiest way to add quickly new crawl feature to nutch, waiting for a best way to improve crawl policy.
> I post two patch.
> Sorry for my very poor english 
> --
> Philippe

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-173) PerHost Crawling Policy ( crawl.ignore.external.links )

Posted by "Christophe Noel (JIRA)" <ji...@apache.org>.

    [ http://issues.apache.org/jira/browse/NUTCH-173?page=comments#action_12375300 ] 

Christophe Noel commented on NUTCH-173:
---------------------------------------

We are TENS of nutch users using this precious patch.

Most of nutch users are not making whole-web search engine (too much hardware needed) but are willing to develop dedicated search engines.

We crawl sometimes 1000, sometimes 25000 web servers and it really slow down the crawling with 25000 entries in prefix-urlfilter.

This patch is NEEDED !

Christophe Noël
CETIC
Belgium

> PerHost Crawling Policy ( crawl.ignore.external.links )
> -------------------------------------------------------
>
>          Key: NUTCH-173
>          URL: http://issues.apache.org/jira/browse/NUTCH-173
>      Project: Nutch
>         Type: New Feature

>   Components: fetcher
>     Versions: 0.7.1, 0.7, 0.8-dev
>     Reporter: Philippe EUGENE
>     Priority: Minor
>  Attachments: patch.txt, patch08.txt
>
> There is two major way of crawl in Nutch.
> Intranet Crawl : forbidden all, allow somes few host
> Whole-web crawl : allow all, forbidden few thinks
> I propose a third type of crawl.
> Directory Crawl : The purpose of this crawl is to manage few thousands of host wihtout managing rules pattern in UrlFilterRegexp.
> I made two patch for : 0.7, 0.7.1 and 0.8-dev
> I propose a new boolean property in nutch-site.xml : crawl.ignore.external.links, with false value at default.
> By default this new feature don't modify the behavior of nutch crawler.
> When you setup this property to true, the crawler don't fetch external links of the host.
> So the crawl is limited to the host that you inject at the beginning at the crawl.
> I know there is some proposal of new crawl policy using the CrawlDatum in 0.8-dev branch. 
> This feature colud be a easiest way to add quickly new crawl feature to nutch, waiting for a best way to improve crawl policy.
> I post two patch.
> Sorry for my very poor english 
> --
> Philippe

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira