You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Stefan Groschupf <sg...@media-style.com> on 2005/03/03 11:06:55 UTC

Re: [Nutch-general] combining crawl directories

Keith,

just copy your all your segment folders in one folder.
The create a new empty webdb. As next you can simply update your new 
webdb with all your existing segments.
The result is a merge of your 2 crawl directories.
BTW. there was some discussion how to do location limited crawling in 
the developer mailing list.

HTH
Stefan

Am 03.03.2005 um 06:18 schrieb Keith Campbell:

> Hi,
>
> I'm a new user of Nutch running a beta Boston-based search engine 
> here: http://www.beantownsearch.com/. I've been evaluating Nutch and 
> testing various "crawl strategies".
>
> Because of current bandwidth/hardware limitations, I've had to devise 
> a methodical and selective "crawl strategy" to index the most relevant 
> (and comprehensive) Boston-specific websites. (So far so good. My 
> results already appear to be on par with Google's results). In order 
> to continue this strategy (and to see if it is possible to offer a 
> better local search than Google's) I need to learn how to combine 
> multiple crawl directories (each many Gigabytes).
>
> crawl_directory1
> 	- db
> 	- segments
> crawl_directory2
> 	- db
> 	- segments
> etc ...
>
> The first directory was seeded with a very large number of 
> Boston-specific URL's which were then used for "intranet crawling" 
> followed by db analysis and numerous rounds of "whole web crawling". 
> I've created additional crawl directories by "intranet crawling" other 
> large collections of distinct Boston URL's. I've now reached the stage 
> where I need to be able to combine these crawl directories into one 
> crawl directory which can then be analyzed and used for continued 
> "whole web crawling." This step is important because it's the only 
> method I could think of to keep my "whole web crawling" on target 
> (i.e. in the Boston area).
>
> So my question is: How do I combine multiple crawl directories into 
> one directory which can be used for additional "whole web crawling"?
>
> Thanks in advance for any help,
> Keith Campbell
>
>
>
> -------------------------------------------------------
> SF email is sponsored by - The IT Product Guide
> Read honest & candid reviews on hundreds of IT Products from real 
> users.
> Discover which products truly live up to the hype. Start reading now.
> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
> _______________________________________________
> Nutch-general mailing list
> Nutch-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/nutch-general
>
>
-----------information technology-------------------
company:     http://www.media-style.com
forum:           http://www.text-mining.org
blog:	             http://www.find23.net

location identification (was: combining crawl directories)

Posted by Stefan Groschupf <sg...@media-style.com>.

Keith,

may I was not clear we have may a other use case.
The goal of our research project was to identify the location of the 
content publisher.
Since publisher and domain owner are mostly identical our result was 
fair enough.
Of curse we cache whois results since this is a bottleneck and not 
allowed to do dos attacks on whois servers.
Further more we do named entity extraction and then try to classify the 
entities like address based on the context informations - to assign an 
location to an domain.
However the first method provide better qualities.

Anyway in case you would love to get the location of a server you can 
do a other nice trick.
First there are some free and commercial datasets ip ranges to 
locations.
Do a traceroute to the target server and then try to identify one of 
the latest servers in the ip-location db. :-)
Please note that this method is secured by a patent of a company.
I heard google used this method as well but a cout descide that is 
mechanism is patented by these company - i didn't remember the name.

HTH
Stefan






Am 10.03.2005 um 20:05 schrieb Keith Campbell:

> Stefan,
>
> I finally got around to reading the discussion on location limited 
> crawling to which you referred me. The approach you discussed appears 
> to rely on WHOIS lookups to determine the geographic location of web 
> servers to filter urls. It's an interesting idea (and maybe the only 
> way to automate a geographic crawl) but I'm curious about the kinds of 
> results you get with this. You are assuming that most websites use 
> local hosting services and that these websites when locally hosted 
> have content about that locality (rather than some other locality or 
> non-local specific content)
>
> Just curious, from your experience, is there a strong correlation 
> between the geographic location of web hosts and geographic-specific 
> content on those webservers?
>
> How much "noise" do you get from doing this kind of IP-based crawl?
>
> Keith
>
> On Mar 3, 2005, at 5:06 AM, Stefan Groschupf wrote:
>
>> Keith,
>>
>> just copy your all your segment folders in one folder.
>> The create a new empty webdb. As next you can simply update your new 
>> webdb with all your existing segments.
>> The result is a merge of your 2 crawl directories.
>> BTW. there was some discussion how to do location limited crawling in 
>> the developer mailing list.
>>
>> HTH
>> Stefan
>>
>> Am 03.03.2005 um 06:18 schrieb Keith Campbell:
>>
>>> Hi,
>>>
>>> I'm a new user of Nutch running a beta Boston-based search engine 
>>> here: http://www.beantownsearch.com/. I've been evaluating Nutch and 
>>> testing various "crawl strategies".
>>>
>>> Because of current bandwidth/hardware limitations, I've had to 
>>> devise a methodical and selective "crawl strategy" to index the most 
>>> relevant (and comprehensive) Boston-specific websites. (So far so 
>>> good. My results already appear to be on par with Google's results). 
>>> In order to continue this strategy (and to see if it is possible to 
>>> offer a better local search than Google's) I need to learn how to 
>>> combine multiple crawl directories (each many Gigabytes).
>>>
>>> crawl_directory1
>>> 	- db
>>> 	- segments
>>> crawl_directory2
>>> 	- db
>>> 	- segments
>>> etc ...
>>>
>>> The first directory was seeded with a very large number of 
>>> Boston-specific URL's which were then used for "intranet crawling" 
>>> followed by db analysis and numerous rounds of "whole web crawling". 
>>> I've created additional crawl directories by "intranet crawling" 
>>> other large collections of distinct Boston URL's. I've now reached 
>>> the stage where I need to be able to combine these crawl directories 
>>> into one crawl directory which can then be analyzed and used for 
>>> continued "whole web crawling." This step is important because it's 
>>> the only method I could think of to keep my "whole web crawling" on 
>>> target (i.e. in the Boston area).
>>>
>>> So my question is: How do I combine multiple crawl directories into 
>>> one directory which can be used for additional "whole web crawling"?
>>>
>>> Thanks in advance for any help,
>>> Keith Campbell
>>>
>>>
>>>
>>> -------------------------------------------------------
>>> SF email is sponsored by - The IT Product Guide
>>> Read honest & candid reviews on hundreds of IT Products from real 
>>> users.
>>> Discover which products truly live up to the hype. Start reading now.
>>> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
>>> _______________________________________________
>>> Nutch-general mailing list
>>> Nutch-general@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/nutch-general
>>>
>>>
>> -----------information technology-------------------
>> company:     http://www.media-style.com
>> forum:           http://www.text-mining.org
>> blog:	             http://www.find23.net
>>
>
>
>
---------------------------------------------------------------
company:		http://www.media-style.com
forum:		http://www.text-mining.org
blog:			http://www.find23.net

Re: [Nutch-general] combining crawl directories

Posted by Keith Campbell <ke...@mac.com>.

Stefan,

I finally got around to reading the discussion on location limited 
crawling to which you referred me. The approach you discussed appears 
to rely on WHOIS lookups to determine the geographic location of web 
servers to filter urls. It's an interesting idea (and maybe the only 
way to automate a geographic crawl) but I'm curious about the kinds of 
results you get with this. You are assuming that most websites use 
local hosting services and that these websites when locally hosted have 
content about that locality (rather than some other locality or 
non-local specific content)

Just curious, from your experience, is there a strong correlation 
between the geographic location of web hosts and geographic-specific 
content on those webservers?

How much "noise" do you get from doing this kind of IP-based crawl?

Keith

On Mar 3, 2005, at 5:06 AM, Stefan Groschupf wrote:

> Keith,
>
> just copy your all your segment folders in one folder.
> The create a new empty webdb. As next you can simply update your new 
> webdb with all your existing segments.
> The result is a merge of your 2 crawl directories.
> BTW. there was some discussion how to do location limited crawling in 
> the developer mailing list.
>
> HTH
> Stefan
>
> Am 03.03.2005 um 06:18 schrieb Keith Campbell:
>
>> Hi,
>>
>> I'm a new user of Nutch running a beta Boston-based search engine 
>> here: http://www.beantownsearch.com/. I've been evaluating Nutch and 
>> testing various "crawl strategies".
>>
>> Because of current bandwidth/hardware limitations, I've had to devise 
>> a methodical and selective "crawl strategy" to index the most 
>> relevant (and comprehensive) Boston-specific websites. (So far so 
>> good. My results already appear to be on par with Google's results). 
>> In order to continue this strategy (and to see if it is possible to 
>> offer a better local search than Google's) I need to learn how to 
>> combine multiple crawl directories (each many Gigabytes).
>>
>> crawl_directory1
>> 	- db
>> 	- segments
>> crawl_directory2
>> 	- db
>> 	- segments
>> etc ...
>>
>> The first directory was seeded with a very large number of 
>> Boston-specific URL's which were then used for "intranet crawling" 
>> followed by db analysis and numerous rounds of "whole web crawling". 
>> I've created additional crawl directories by "intranet crawling" 
>> other large collections of distinct Boston URL's. I've now reached 
>> the stage where I need to be able to combine these crawl directories 
>> into one crawl directory which can then be analyzed and used for 
>> continued "whole web crawling." This step is important because it's 
>> the only method I could think of to keep my "whole web crawling" on 
>> target (i.e. in the Boston area).
>>
>> So my question is: How do I combine multiple crawl directories into 
>> one directory which can be used for additional "whole web crawling"?
>>
>> Thanks in advance for any help,
>> Keith Campbell
>>
>>
>>
>> -------------------------------------------------------
>> SF email is sponsored by - The IT Product Guide
>> Read honest & candid reviews on hundreds of IT Products from real 
>> users.
>> Discover which products truly live up to the hype. Start reading now.
>> http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
>> _______________________________________________
>> Nutch-general mailing list
>> Nutch-general@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/nutch-general
>>
>>
> -----------information technology-------------------
> company:     http://www.media-style.com
> forum:           http://www.text-mining.org
> blog:	             http://www.find23.net
>