You are viewing a plain text version of this content. The canonical link for it is here.

Posted to agent@nutch.apache.org by Kirk Gillock <pk...@isara.org> on 2009/12/05 15:29:20 UTC

HTTP Header problem

Hi fellow Nutch users.

Long time crawler, first time poster. :-)

We're 23m pages into a 100m page crawl and our preliminary tests have 
shown that a lot of pages contain our agent name, description, etc., in 
their page content. Meaning, sites that have a script which show http 
headers (typically to show browser information) causes the Nutch crawler 
to store its own header information within the content of that page. So 
when we search our index for "Isara" (our agent name) we get thousands 
of results and they all have "Isara/Isara-1.0 (A non-profit search 
engine benefiting charity.; http://www.isara.org; e-mail@removed.org", 
which is the content of our nutch-default.xml file: http.agent.name, 
http.agent.description, http.agent.url, http.agent.email, and 
http.agent.version .

I've searched around and haven't found any information on how to stop 
this from happening. Is there a solution and, if so, will it mean we 
need to recrawl all those pages again or can we filter the current 
database? Any suggestions would be greatly appreciated.

Thank you for developing such an important open-source application,
Kirk Gillock
Isara Charity Foundation
Nong Khai, Thailand
http://www.isara.org

Re: HTTP Header problem

Posted by Kirk Gillock <pk...@isara.org>.

Thank you for the quick reply, Dennis. It was worth a shot. :-)

People are not typically searching for our own name on our own site but, 
in case it did happen, we wanted to have the results be as clean as 
possible. For our next crawls we'll change the agent name and version to 
something else.

Thanks again,
Kirk


Dennis Kubes wrote:
> There isn't a way to stop this from happening really except to change 
> the agent name in the Nutch configuration.  When an http request is 
> made, the agent name is sent as a header.  There are many pages as you 
> say that simply have logs of different user-agents hitting their sites 
> or have a script to spit back the user agent when a crawler is detected.
>
> Dennis
>
> Kirk Gillock wrote:
>> Hi fellow Nutch users.
>>
>> Long time crawler, first time poster. :-)
>>
>> We're 23m pages into a 100m page crawl and our preliminary tests have 
>> shown that a lot of pages contain our agent name, description, etc., 
>> in their page content. Meaning, sites that have a script which show 
>> http headers (typically to show browser information) causes the Nutch 
>> crawler to store its own header information within the content of 
>> that page. So when we search our index for "Isara" (our agent name) 
>> we get thousands of results and they all have "Isara/Isara-1.0 (A 
>> non-profit search engine benefiting charity.; http://www.isara.org; 
>> e-mail@removed.org", which is the content of our nutch-default.xml 
>> file: http.agent.name, http.agent.description, http.agent.url, 
>> http.agent.email, and http.agent.version .
>>
>> I've searched around and haven't found any information on how to stop 
>> this from happening. Is there a solution and, if so, will it mean we 
>> need to recrawl all those pages again or can we filter the current 
>> database? Any suggestions would be greatly appreciated.
>>
>> Thank you for developing such an important open-source application,
>> Kirk Gillock
>> Isara Charity Foundation
>> Nong Khai, Thailand
>> http://www.isara.org
> ------------------------------------------------------------------------
>
>
> No virus found in this incoming message.
> Checked by AVG - www.avg.com 
> Version: 8.5.426 / Virus Database: 270.14.95/2546 - Release Date: 12/05/09 08:13:00
>
>

Re: HTTP Header problem

Posted by Dennis Kubes <ku...@apache.org>.

There isn't a way to stop this from happening really except to change 
the agent name in the Nutch configuration.  When an http request is 
made, the agent name is sent as a header.  There are many pages as you 
say that simply have logs of different user-agents hitting their sites 
or have a script to spit back the user agent when a crawler is detected.

Dennis

Kirk Gillock wrote:
> Hi fellow Nutch users.
> 
> Long time crawler, first time poster. :-)
> 
> We're 23m pages into a 100m page crawl and our preliminary tests have 
> shown that a lot of pages contain our agent name, description, etc., in 
> their page content. Meaning, sites that have a script which show http 
> headers (typically to show browser information) causes the Nutch crawler 
> to store its own header information within the content of that page. So 
> when we search our index for "Isara" (our agent name) we get thousands 
> of results and they all have "Isara/Isara-1.0 (A non-profit search 
> engine benefiting charity.; http://www.isara.org; e-mail@removed.org", 
> which is the content of our nutch-default.xml file: http.agent.name, 
> http.agent.description, http.agent.url, http.agent.email, and 
> http.agent.version .
> 
> I've searched around and haven't found any information on how to stop 
> this from happening. Is there a solution and, if so, will it mean we 
> need to recrawl all those pages again or can we filter the current 
> database? Any suggestions would be greatly appreciated.
> 
> Thank you for developing such an important open-source application,
> Kirk Gillock
> Isara Charity Foundation
> Nong Khai, Thailand
> http://www.isara.org