You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Stefan Groschupf <sg...@media-style.com> on 2005/12/19 19:37:47 UTC

problems http-client

Hi there,

is there someone out there that can confirm a problem we discovered?

We was wondering why not all pages of a  generated segments was  
fetched. The most strange thing was that the  sum of errors and  
sucesspages was never the same as we defined in topN when generating  
a sgemtent .
First we discovered a problem with the segment size, but I can not  
reproduce the problem anymore with the latest trunk code. :-/
Very strange since I don't think something changed something but I  
was able to reproduce that the size of the segment is around than 50%  
of the defined size (topN) on 2 different map reduce installations.

Anyway today we note that when fetching with http-client the sum of  
errors and fetched pages is  much less than the size defined when  
generating the segment.
Changing to protocol-http solves the problem.
Has anyone also note this behavior?

Thanks for comments.
Stefan






Re: problems http-client

Posted by Andrzej Bialecki <ab...@getopt.org>.
Stefan Groschupf wrote:

> OK I will do that tomorrow!
> However in case it is known as buggy, we may should not set up as  
> default http protocol plugin as it is by today.
> Newbies checking out nutch ill use the version that does not fetch  
> all pages, since most people start with the standard configuration.


Well, it's a question of what beginners need - stability or features. 
protocol-httpclient handles in a better way many web features out of the 
box, such as e.g. cookies, authentication, proxy, https and redirects. I 
think also that some results codes are handled in a better way.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: problems http-client

Posted by Doug Cutting <cu...@nutch.org>.
Andrzej Bialecki wrote:
> Hmm... I'm not saying it's flawless, there were surely some mysterious 
> things going on with it. That large crawl you mention, was it with the 
> (recently updated in Nutch) release 3.0? What were the issues?

No, it was in early December, with the previous version.  I don't recall 
the details, but it seemed slower, had a higher error rate, and seemed 
to result in more hung thread incidents.

> The main advantage of protocol-http is that it's so simple that few 
> things can go wrong, but this also means it's relatively 
> unsophisticated, and adding more advanced features could mean a lot of 
> work. Namely, adding support for https, cookies and authentication.

These are all good reasons to use protocol-httpclient.  But if you don't 
need any of those features, protocol-http seems to presently work better.

Perhaps we should get more feedback on the 3.0 version before we make a 
decision?

Doug

Re: problems http-client

Posted by Andrzej Bialecki <ab...@getopt.org>.
Jérôme Charron wrote:

>>>A related issue is that these two plugins replicate a lot of code.  At
>>>some point we should try to fix that.  See:
>>>
>>>
>>>      
>>>
>>http://www.nabble.com/protocol-http-versus-protocol-httpclient-t521282.html
>>    
>>
>
>I have beginning working on this. Nobody else? Can I go on?
>
>  
>

Please do go on!

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: problems http-client

Posted by Jérôme Charron <je...@gmail.com>.
> > A related issue is that these two plugins replicate a lot of code.  At
> > some point we should try to fix that.  See:
> >
> >
> http://www.nabble.com/protocol-http-versus-protocol-httpclient-t521282.html

I have beginning working on this. Nobody else? Can I go on?

Jérôme

--
http://motrech.free.fr/
http://www.frutch.org/

Re: problems http-client

Posted by Andrzej Bialecki <ab...@getopt.org>.
Doug Cutting wrote:

> Stefan Groschupf wrote:
>
>> However in case it is known as buggy, we may should not set up as  
>> default http protocol plugin as it is by today.
>
>
> +1
>
> I have found protocol-http to be more reliable for large crawls than 
> protocol-httpclient and would be in favor of switching the default 
> back to protocol-http.  When folks need advanced features then they 
> can switch to protocol-httpclient.  Thoughts?
>

Hmm... I'm not saying it's flawless, there were surely some mysterious 
things going on with it. That large crawl you mention, was it with the 
(recently updated in Nutch) release 3.0? What were the issues?

The main advantage of protocol-http is that it's so simple that few 
things can go wrong, but this also means it's relatively 
unsophisticated, and adding more advanced features could mean a lot of 
work. Namely, adding support for https, cookies and authentication.

> A related issue is that these two plugins replicate a lot of code.  At 
> some point we should try to fix that.  See:
>
> http://www.nabble.com/protocol-http-versus-protocol-httpclient-t521282.html 
>


Yes.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: problems http-client

Posted by Doug Cutting <cu...@nutch.org>.
Stefan Groschupf wrote:
> However in case it is known as buggy, we may should not set up as  
> default http protocol plugin as it is by today.

+1

I have found protocol-http to be more reliable for large crawls than 
protocol-httpclient and would be in favor of switching the default back 
to protocol-http.  When folks need advanced features then they can 
switch to protocol-httpclient.  Thoughts?

A related issue is that these two plugins replicate a lot of code.  At 
some point we should try to fix that.  See:

http://www.nabble.com/protocol-http-versus-protocol-httpclient-t521282.html

Doug

Re: problems http-client

Posted by Stefan Groschupf <sg...@media-style.com>.
OK I will do that tomorrow!
However in case it is known as buggy, we may should not set up as  
default http protocol plugin as it is by today.
Newbies checking out nutch ill use the version that does not fetch  
all pages, since most people start with the standard configuration.

Am 19.12.2005 um 19:47 schrieb Andrzej Bialecki:

> Stefan Groschupf wrote:
>
>> Anyway today we note that when fetching with http-client the sum  
>> of  errors and fetched pages is  much less than the size defined  
>> when  generating the segment.
>> Changing to protocol-http solves the problem.
>> Has anyone also note this behavior?
>
>
> I haven't, but this plugin is known to have some issues... Could  
> you add some log messages here and there to confirm this, like  
> counting the number of invocations of getProtocolOutput in protocol- 
> httpclient vs. the number of calls to FetcherThread.output(). This  
> could be a bug somewhere in the redirect handling code.
>
> -- 
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>
>

---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net



Re: problems http-client

Posted by Andrzej Bialecki <ab...@getopt.org>.
Stefan Groschupf wrote:

> Anyway today we note that when fetching with http-client the sum of  
> errors and fetched pages is  much less than the size defined when  
> generating the segment.
> Changing to protocol-http solves the problem.
> Has anyone also note this behavior?


I haven't, but this plugin is known to have some issues... Could you add 
some log messages here and there to confirm this, like counting the 
number of invocations of getProtocolOutput in protocol-httpclient vs. 
the number of calls to FetcherThread.output(). This could be a bug 
somewhere in the redirect handling code.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: problems http-client

Posted by Ken Krugler <kk...@transpac.com>.
>I have started to see this problem recently. topN=200000 per crawl, but
>fetched pages = 150000 - 170000, while error pages = 2000 - 5000.  >25000
>pages are missing.  this is reproducible with nutch0.7.1, both protocol-http
>and protocol-httpclient are included.

Depending on how you have Nutch configured, redirects can result in 
pages getting skipped, if the redirect count exceeds the 
(configurable) limit.

I don't know whether the "not found" HTTP status results in skipped 
(not reported as an error) case.

>I also see lots of "Response content length is not known" in the log.  but,
>can't find where it comes from.  Which class logs this message?

This is coming from the Jakarta commons httpclient code:

/src/java/org/apache/commons/httpclient/HttpMethodBase.java

-- Ken

>On 12/19/05, Stefan Groschupf <sg...@media-style.com> wrote:
>>
>>  Hi there,
>>
>>  is there someone out there that can confirm a problem we discovered?
>>
>>  We was wondering why not all pages of a  generated segments was
>>  fetched. The most strange thing was that the  sum of errors and
>>  sucesspages was never the same as we defined in topN when generating
>>  a sgemtent .
>>  First we discovered a problem with the segment size, but I can not
>>  reproduce the problem anymore with the latest trunk code. :-/
>>  Very strange since I don't think something changed something but I
>>  was able to reproduce that the size of the segment is around than 50%
>>  of the defined size (topN) on 2 different map reduce installations.
>>
>>  Anyway today we note that when fetching with http-client the sum of
>>  errors and fetched pages is  much less than the size defined when
>>  generating the segment.
>>  Changing to protocol-http solves the problem.
>>  Has anyone also note this behavior?
>>
>>  Thanks for comments.
>>  Stefan
>>
>>
>>
>>
>>
>>


-- 
Ken Krugler
Krugle, Inc.
+1 530-470-9200

Re: problems http-client

Posted by AJ Chen <ca...@gmail.com>.
I have started to see this problem recently. topN=200000 per crawl, but
fetched pages = 150000 - 170000, while error pages = 2000 - 5000.  >25000
pages are missing.  this is reproducible with nutch0.7.1, both protocol-http
and protocol-httpclient are included.

I also see lots of "Response content length is not known" in the log.  but,
can't find where it comes from.  Which class logs this message?

AJ

On 12/19/05, Stefan Groschupf <sg...@media-style.com> wrote:
>
> Hi there,
>
> is there someone out there that can confirm a problem we discovered?
>
> We was wondering why not all pages of a  generated segments was
> fetched. The most strange thing was that the  sum of errors and
> sucesspages was never the same as we defined in topN when generating
> a sgemtent .
> First we discovered a problem with the segment size, but I can not
> reproduce the problem anymore with the latest trunk code. :-/
> Very strange since I don't think something changed something but I
> was able to reproduce that the size of the segment is around than 50%
> of the defined size (topN) on 2 different map reduce installations.
>
> Anyway today we note that when fetching with http-client the sum of
> errors and fetched pages is  much less than the size defined when
> generating the segment.
> Changing to protocol-http solves the problem.
> Has anyone also note this behavior?
>
> Thanks for comments.
> Stefan
>
>
>
>
>
>

Re: problems http-client

Posted by Michael <mi...@gameservice.ru>.
The same problem on FreeBSD 6.0 + jdk1.4.2
I think it was also reported some time ago by Rod Taylor.

Switch to protocol-http.

SG> Hi there,

SG> is there someone out there that can confirm a problem we discovered?

SG> We was wondering why not all pages of a  generated segments was  
SG> fetched. The most strange thing was that the  sum of errors and  
SG> sucesspages was never the same as we defined in topN when generating  
SG> a sgemtent .
SG> First we discovered a problem with the segment size, but I can not  
SG> reproduce the problem anymore with the latest trunk code. :-/
SG> Very strange since I don't think something changed something but I  
SG> was able to reproduce that the size of the segment is around than 50%
SG> of the defined size (topN) on 2 different map reduce installations.

SG> Anyway today we note that when fetching with http-client the sum of  
SG> errors and fetched pages is  much less than the size defined when  
SG> generating the segment.
SG> Changing to protocol-http solves the problem.
SG> Has anyone also note this behavior?

SG> Thanks for comments.
SG> Stefan








Michael