You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Iain Lopata <il...@hotmail.com> on 2013/12/08 19:06:09 UTC

Unsuccessful fetch/parse of large page with many outlinks

I am running Nutch 1.6 on Ubuntu Server.

 

I am experiencing a problem with one particular webpage.

 

If I use parsechecker against the problem url the output shows (host name
changed to example.com):

 

================================================================

fetching: http://www.example.com/index.cfm?pageID=12

text/html

parsing: http://www.example.com/index.cfm?pageID=12

contentType: text/html

signature: a9c640626fcad48caaf3ad5f94bea446

---------

Url

---------------

http://www.example.com/index.cfm?pageID=12

---------

ParseData

---------

Version: 5

Status: success(1,0)

Title:

Outlinks: 0

Content Metadata: Date=Sun, 08 Dec 2013 17:32:33 GMT
Set-Cookie=CFTOKEN=96208061;path=/ Content-Type=text/html; charset=UTF-8
Connection=close X-Powered-By=ASP.NET Server=Microsoft-IIS/6.0

Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8

========================================================================

 

However, this page has 3775 outlinks.

 

If I run a  crawl with this page as a seed the log file shows that the file
it fetched successfully, but debug code that I have inserted in a custom
filter shows that the file that was retrieved is only 198 bytes long.  For
some reason the file would seem to be truncated or otherwise corrupted.

 

I can retrieve the file with wget and can see that the file is 597KB.

 

I copied the file that I retrieved with wget to another web server and
attempted to crawl it from that site and it works fine, retrieving all 597KB
and parsing it successfully.  This would suggest that my current
configuration does not have a problem processing this large file.

 

I have checked the robots.txt file on the original host and it allows
retrieval of this web page.

 

Other relevant configuration settings may be:

 

<property>

    <name>http.content.limit</name>

    <value>-1</value>

</property>

<property>

         <name>http.timeout</name>

         <value>60000</value>

         <description></description>

</property>

 

Any ideas on what to check next?

RE: Unsuccessful fetch/parse of large page with many outlinks

Posted by Iain Lopata <il...@hotmail.com>.

The debug code did not interfere with the parser. How do I know?  Because:

a) The same debug code runs when I retrieve the page from a different host.
b) If I remove the filter with the debug code from the configuration the
page is still no parsed correctly.

I have run a segment dump and the relevant portion of the dump output for
this url shows simply:

====================================

ParseText::

Content::
Version: -1

=====================================

The configuration between the two crawls was identical.  It is being run
from the same nutch machine and config.  Only the host from which it is
retrieved is different.  i.e.  If I retrieve from www.example.com/page.html
it fails, but if I copy the file to www.myexample.com/page.html then it
works correctly

I have:

log4j.logger.org.apache.nutch.fetcher.Fetcher=DEBUG,cmdstdout

I do not see an entry in the log4j config for http-protocol, but I do have:

<property>
  <name>http.verbose</name>
  <value>true</value>
  <description>If true, HTTP will log more verbosely.</description>
</property>

I have now discovered that the same problem appears to be true for several
(maybe even all) pages on this host. Looks like a connection problem rather
than a data problem to me.


-----Original Message-----
From: Tejas Patil [mailto:tejas.patil.cs@gmail.com] 
Sent: Sunday, December 08, 2013 2:29 PM
To: user@nutch.apache.org
Subject: Re: Unsuccessful fetch/parse of large page with many outlinks

> debug code that I have inserted in a custom filter shows that the file
that was retrieved is only 198 bytes long.
I am assuming that this code did not hinder the crawler. A better way to see
the content would be to take a segment dump [0] and then analyse it.
Also, turn on DEBUG mode of the log4j for the http protocol classes and
fetcher class.

> attempted to crawl it from that site and it works fine, retrieving all
597KB and parsing it successfully.
You mean that you ran a nutch crawl with the problematic url as a seed and
used the EXACT same config on both machines. One machine gave perfect
content and the other one was not. Note that using EXACT same config over
these 2 runs is important.

> the page has about 350 characters of LineFeeds CarriageRetruns and 
> spaces
No way. The HTTP request gets a byte stream as response. Also, had it been
the case that LF or CR chars create problem, then it must hit nutch
irrespective of from which machine you run nutch...but thats not what your
experiments suggest.

[0] : http://wiki.apache.org/nutch/bin/nutch_readseg



On Sun, Dec 8, 2013 at 11:23 AM, Iain Lopata <il...@hotmail.com> wrote:

> I do not know whether this would be a factor, but I have noticed that 
> the page has about 350 characters of LineFeeds CarriageRetruns and 
> spaces before the <!DOCTYPE> declaration.  Could this be causing a 
> problem for
> http-protocol in some way?   Howver, I can't explain why the same file
with
> the same LF, CR and whitespace would read correctly from a different host.
>
> -----Original Message-----
> From: Iain Lopata [mailto:ilopata1@hotmail.com]
> Sent: Sunday, December 08, 2013 12:06 PM
> To: user@nutch.apache.org
> Subject: Unsuccessful fetch/parse of large page with many outlinks
>
> I am running Nutch 1.6 on Ubuntu Server.
>
>
>
> I am experiencing a problem with one particular webpage.
>
>
>
> If I use parsechecker against the problem url the output shows (host 
> name changed to example.com):
>
>
>
> ================================================================
>
> fetching: http://www.example.com/index.cfm?pageID=12
>
> text/html
>
> parsing: http://www.example.com/index.cfm?pageID=12
>
> contentType: text/html
>
> signature: a9c640626fcad48caaf3ad5f94bea446
>
> ---------
>
> Url
>
> ---------------
>
> http://www.example.com/index.cfm?pageID=12
>
> ---------
>
> ParseData
>
> ---------
>
> Version: 5
>
> Status: success(1,0)
>
> Title:
>
> Outlinks: 0
>
> Content Metadata: Date=Sun, 08 Dec 2013 17:32:33 GMT 
> Set-Cookie=CFTOKEN=96208061;path=/ Content-Type=text/html; 
> charset=UTF-8 Connection=close X-Powered-By=ASP.NET 
> Server=Microsoft-IIS/6.0
>
> Parse Metadata: CharEncodingForConversion=utf-8 
> OriginalCharEncoding=utf-8
>
> ======================================================================
> ==
>
>
>
> However, this page has 3775 outlinks.
>
>
>
> If I run a  crawl with this page as a seed the log file shows that the 
> file it fetched successfully, but debug code that I have inserted in a 
> custom filter shows that the file that was retrieved is only 198 bytes 
> long.  For some reason the file would seem to be truncated or otherwise
corrupted.
>
>
>
> I can retrieve the file with wget and can see that the file is 597KB.
>
>
>
> I copied the file that I retrieved with wget to another web server and 
> attempted to crawl it from that site and it works fine, retrieving all 
> 597KB and parsing it successfully.  This would suggest that my current 
> configuration does not have a problem processing this large file.
>
>
>
> I have checked the robots.txt file on the original host and it allows 
> retrieval of this web page.
>
>
>
> Other relevant configuration settings may be:
>
>
>
> <property>
>
>     <name>http.content.limit</name>
>
>     <value>-1</value>
>
> </property>
>
> <property>
>
>          <name>http.timeout</name>
>
>          <value>60000</value>
>
>          <description></description>
>
> </property>
>
>
>
> Any ideas on what to check next?
>
>
>
>
>

Re: Unsuccessful fetch/parse of large page with many outlinks

Posted by "S.L" <si...@gmail.com>.

Interesting Ian, if its a firewall issue , it seems there is nothing that
Nutch can do .


On Tue, Dec 10, 2013 at 1:42 PM, Iain Lopata <il...@hotmail.com> wrote:

> Solved.
>
> So I started to prepare a stripped down routine outside Nutch to file a bug
> report, but in the process have solved the problem.
>
> The issue was with the User-Agent string that I had configured.  Apparently
> the domain in question runs dotDefender, a software firewall that checks,
> among other things, the User-Agent string against an acceptable database.
> Curl, Wget and PHP's libcurl validate against the database.  My custom and
> valid User-Agent does not, even if I use "Nutch" as the agent name.
>
> Changing the User-Agent string back to the Nutch default, or mimicking
> common browsers solves the problem. i.e. it would seem the site is happy to
> be crawled by Nutch instances.
>
> So this leaves me with a question.  Are there recommendations for a
> properly
> configured User-Agent string that identifies an instance of a Nutch Crawler
> and does not run afoul of a firewall like this?  Using the Nutch default or
> copying from another library or browser would both work, but don't seem
> right.  I don't see other options.
>
> -----Original Message-----
> From: Tejas Patil [mailto:tejas.patil.cs@gmail.com]
> Sent: Tuesday, December 10, 2013 12:57 AM
> To: user@nutch.apache.org
> Subject: Re: Unsuccessful fetch/parse of large page with many outlinks
>
> I think that you narrowed it down and most probably its some
> bug/incompatibility of the HTTP library which nutch uses to talk with the
> server. Were both the servers where you hosted the url of IIS 6.0 ? If yes,
> then there is more :)
>
> Thanks,
> Tejas
>
>
> On Mon, Dec 9, 2013 at 3:32 PM, Iain Lopata <il...@hotmail.com> wrote:
>
> > Out of ideas at this point.
> >
> > I can retrieve the page with Curl
> > I can retrieve the page with Wget
> > I can view the page in my browser
> > I can retrieve the page by opening a socket from a PHP script I can
> > retrieve the page with nutch if I move the page to another host
> >
> > But
> >
> > Any page I try and fetch from www.friedfrank.com with Nutch reads just
> > 198 bytes and then closes the stream.
> >
> > Debug code inserted in HttpResponse and WireShark both show that this
> > is the case.
> >
> > Could someone else please try and fetch a page from this host from
> > your config?
> >
> > My suspicion is that it is related to this host being on IIS 6.0 with
> > this problem being a potential cause:
> > http://support.microsoft.com/kb/919797
> >
> > -----Original Message-----
> > From: Iain Lopata [mailto:ilopata1@hotmail.com]
> > Sent: Monday, December 09, 2013 7:36 AM
> > To: user@nutch.apache.org
> > Subject: RE: Unsuccessful fetch/parse of large page with many outlinks
> >
> > Parses 652 outlinks from the ebay url without any difficulty.
> >
> > Didn't want to change the title and thereby break this thread, but at
> > this point, and as stated in my last post,  I am reasonably confident
> > that for some reason the InputReader in HttpResponse.java sees the
> > stream as closed after reading only 198 bytes.  Why I do not know.
> >
> > -----Original Message-----
> > From: S.L [mailto:simpleliving016@gmail.com]
> > Sent: Sunday, December 08, 2013 11:44 PM
> > To: user@nutch.apache.org
> > Subject: Re: Unsuccessful fetch/parse of large page with many outlinks
> >
> > I faced a similar problem with this page
> > http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1  when I
> > was running Nutch from within eclipse , I was able to crawl all the
> > outlinks successfully when I ran nutch as a jar outside of eclipse, at
> > that point it was considered to be an issue with running ti in eclipse.
> >
> > Can you please try this URL with your setup ? this has atleast 600+
> > outlinks.
> >
> >
> > On Sun, Dec 8, 2013 at 10:07 PM, Iain Lopata <il...@hotmail.com>
> wrote:
> >
> > > Some further analysis - no solution.
> > >
> > > The pages in question do not return a Content-Length header.
> > >
> > > Since the http.content.limit is set to -1, http-protocol sets the
> > > maximum read length to 2147483647.
> > >
> > > At line 231 of HttpResponse.java the loop:
> > >
> > > for (int i = in.read(bytes); i != -1 && length + i <= contentLength;
> > > i =
> > > in.read(bytes))
> > >
> > > executes once and once only and returns a stream of just 198 bytes.
> > > No exceptions are thrown.
> > >
> > > So, I think, the question becomes why would this connection close
> > > before the end of the stream?  It certainly seems to be server
> > > specific since I can retrieve the file successfully from a different
> > > host domain.
> > >
> > > -----Original Message-----
> > > From: Tejas Patil [mailto:tejas.patil.cs@gmail.com]
> > > Sent: Sunday, December 08, 2013 2:29 PM
> > > To: user@nutch.apache.org
> > > Subject: Re: Unsuccessful fetch/parse of large page with many
> > > outlinks
> > >
> > > > debug code that I have inserted in a custom filter shows that the
> > > > file
> > > that was retrieved is only 198 bytes long.
> > > I am assuming that this code did not hinder the crawler. A better
> > > way to see the content would be to take a segment dump [0] and then
> > > analyse it.
> > > Also, turn on DEBUG mode of the log4j for the http protocol classes
> > > and fetcher class.
> > >
> > > > attempted to crawl it from that site and it works fine, retrieving
> > > > all
> > > 597KB and parsing it successfully.
> > > You mean that you ran a nutch crawl with the problematic url as a
> > > seed and used the EXACT same config on both machines. One machine
> > > gave perfect content and the other one was not. Note that using
> > > EXACT same config over these 2 runs is important.
> > >
> > > > the page has about 350 characters of LineFeeds CarriageRetruns and
> > > > spaces
> > > No way. The HTTP request gets a byte stream as response. Also, had
> > > it been the case that LF or CR chars create problem, then it must
> > > hit nutch irrespective of from which machine you run nutch...but
> > > thats not what your experiments suggest.
> > >
> > > [0] : http://wiki.apache.org/nutch/bin/nutch_readseg
> > >
> > >
> > >
> > > On Sun, Dec 8, 2013 at 11:23 AM, Iain Lopata <il...@hotmail.com>
> > wrote:
> > >
> > > > I do not know whether this would be a factor, but I have noticed
> > > > that the page has about 350 characters of LineFeeds
> > > > CarriageRetruns and spaces before the <!DOCTYPE> declaration.
> > > > Could this be causing a problem for
> > > > http-protocol in some way?   Howver, I can't explain why the same
> file
> > > with
> > > > the same LF, CR and whitespace would read correctly from a
> > > > different
> > > host.
> > > >
> > > > -----Original Message-----
> > > > From: Iain Lopata [mailto:ilopata1@hotmail.com]
> > > > Sent: Sunday, December 08, 2013 12:06 PM
> > > > To: user@nutch.apache.org
> > > > Subject: Unsuccessful fetch/parse of large page with many outlinks
> > > >
> > > > I am running Nutch 1.6 on Ubuntu Server.
> > > >
> > > >
> > > >
> > > > I am experiencing a problem with one particular webpage.
> > > >
> > > >
> > > >
> > > > If I use parsechecker against the problem url the output shows
> > > > (host name changed to example.com):
> > > >
> > > >
> > > >
> > > > ================================================================
> > > >
> > > > fetching: http://www.example.com/index.cfm?pageID=12
> > > >
> > > > text/html
> > > >
> > > > parsing: http://www.example.com/index.cfm?pageID=12
> > > >
> > > > contentType: text/html
> > > >
> > > > signature: a9c640626fcad48caaf3ad5f94bea446
> > > >
> > > > ---------
> > > >
> > > > Url
> > > >
> > > > ---------------
> > > >
> > > > http://www.example.com/index.cfm?pageID=12
> > > >
> > > > ---------
> > > >
> > > > ParseData
> > > >
> > > > ---------
> > > >
> > > > Version: 5
> > > >
> > > > Status: success(1,0)
> > > >
> > > > Title:
> > > >
> > > > Outlinks: 0
> > > >
> > > > Content Metadata: Date=Sun, 08 Dec 2013 17:32:33 GMT
> > > > Set-Cookie=CFTOKEN=96208061;path=/ Content-Type=text/html;
> > > > charset=UTF-8 Connection=close X-Powered-By=ASP.NET
> > > > Server=Microsoft-IIS/6.0
> > > >
> > > > Parse Metadata: CharEncodingForConversion=utf-8
> > > > OriginalCharEncoding=utf-8
> > > >
> > > > ==================================================================
> > > > ==
> > > > ==
> > > > ==
> > > >
> > > >
> > > >
> > > > However, this page has 3775 outlinks.
> > > >
> > > >
> > > >
> > > > If I run a  crawl with this page as a seed the log file shows that
> > > > the file it fetched successfully, but debug code that I have
> > > > inserted in a custom filter shows that the file that was retrieved
> > > > is only 198 bytes long.  For some reason the file would seem to be
> > > > truncated or otherwise
> > > corrupted.
> > > >
> > > >
> > > >
> > > > I can retrieve the file with wget and can see that the file is 597KB.
> > > >
> > > >
> > > >
> > > > I copied the file that I retrieved with wget to another web server
> > > > and attempted to crawl it from that site and it works fine,
> > > > retrieving all 597KB and parsing it successfully.  This would
> > > > suggest that my current configuration does not have a problem
> > > > processing
> > this large file.
> > > >
> > > >
> > > >
> > > > I have checked the robots.txt file on the original host and it
> > > > allows retrieval of this web page.
> > > >
> > > >
> > > >
> > > > Other relevant configuration settings may be:
> > > >
> > > >
> > > >
> > > > <property>
> > > >
> > > >     <name>http.content.limit</name>
> > > >
> > > >     <value>-1</value>
> > > >
> > > > </property>
> > > >
> > > > <property>
> > > >
> > > >          <name>http.timeout</name>
> > > >
> > > >          <value>60000</value>
> > > >
> > > >          <description></description>
> > > >
> > > > </property>
> > > >
> > > >
> > > >
> > > > Any ideas on what to check next?
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > >
> >
> >
> >
>
>

RE: Unsuccessful fetch/parse of large page with many outlinks

Posted by Iain Lopata <il...@hotmail.com>.

Solved.

So I started to prepare a stripped down routine outside Nutch to file a bug
report, but in the process have solved the problem.

The issue was with the User-Agent string that I had configured.  Apparently
the domain in question runs dotDefender, a software firewall that checks,
among other things, the User-Agent string against an acceptable database.
Curl, Wget and PHP's libcurl validate against the database.  My custom and
valid User-Agent does not, even if I use "Nutch" as the agent name.  

Changing the User-Agent string back to the Nutch default, or mimicking
common browsers solves the problem. i.e. it would seem the site is happy to
be crawled by Nutch instances.

So this leaves me with a question.  Are there recommendations for a properly
configured User-Agent string that identifies an instance of a Nutch Crawler
and does not run afoul of a firewall like this?  Using the Nutch default or
copying from another library or browser would both work, but don't seem
right.  I don't see other options.

-----Original Message-----
From: Tejas Patil [mailto:tejas.patil.cs@gmail.com] 
Sent: Tuesday, December 10, 2013 12:57 AM
To: user@nutch.apache.org
Subject: Re: Unsuccessful fetch/parse of large page with many outlinks

I think that you narrowed it down and most probably its some
bug/incompatibility of the HTTP library which nutch uses to talk with the
server. Were both the servers where you hosted the url of IIS 6.0 ? If yes,
then there is more :)

Thanks,
Tejas


On Mon, Dec 9, 2013 at 3:32 PM, Iain Lopata <il...@hotmail.com> wrote:

> Out of ideas at this point.
>
> I can retrieve the page with Curl
> I can retrieve the page with Wget
> I can view the page in my browser
> I can retrieve the page by opening a socket from a PHP script I can 
> retrieve the page with nutch if I move the page to another host
>
> But
>
> Any page I try and fetch from www.friedfrank.com with Nutch reads just 
> 198 bytes and then closes the stream.
>
> Debug code inserted in HttpResponse and WireShark both show that this 
> is the case.
>
> Could someone else please try and fetch a page from this host from 
> your config?
>
> My suspicion is that it is related to this host being on IIS 6.0 with 
> this problem being a potential cause: 
> http://support.microsoft.com/kb/919797
>
> -----Original Message-----
> From: Iain Lopata [mailto:ilopata1@hotmail.com]
> Sent: Monday, December 09, 2013 7:36 AM
> To: user@nutch.apache.org
> Subject: RE: Unsuccessful fetch/parse of large page with many outlinks
>
> Parses 652 outlinks from the ebay url without any difficulty.
>
> Didn't want to change the title and thereby break this thread, but at 
> this point, and as stated in my last post,  I am reasonably confident 
> that for some reason the InputReader in HttpResponse.java sees the 
> stream as closed after reading only 198 bytes.  Why I do not know.
>
> -----Original Message-----
> From: S.L [mailto:simpleliving016@gmail.com]
> Sent: Sunday, December 08, 2013 11:44 PM
> To: user@nutch.apache.org
> Subject: Re: Unsuccessful fetch/parse of large page with many outlinks
>
> I faced a similar problem with this page
> http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1  when I 
> was running Nutch from within eclipse , I was able to crawl all the 
> outlinks successfully when I ran nutch as a jar outside of eclipse, at 
> that point it was considered to be an issue with running ti in eclipse.
>
> Can you please try this URL with your setup ? this has atleast 600+ 
> outlinks.
>
>
> On Sun, Dec 8, 2013 at 10:07 PM, Iain Lopata <il...@hotmail.com> wrote:
>
> > Some further analysis - no solution.
> >
> > The pages in question do not return a Content-Length header.
> >
> > Since the http.content.limit is set to -1, http-protocol sets the 
> > maximum read length to 2147483647.
> >
> > At line 231 of HttpResponse.java the loop:
> >
> > for (int i = in.read(bytes); i != -1 && length + i <= contentLength; 
> > i =
> > in.read(bytes))
> >
> > executes once and once only and returns a stream of just 198 bytes.
> > No exceptions are thrown.
> >
> > So, I think, the question becomes why would this connection close 
> > before the end of the stream?  It certainly seems to be server 
> > specific since I can retrieve the file successfully from a different 
> > host domain.
> >
> > -----Original Message-----
> > From: Tejas Patil [mailto:tejas.patil.cs@gmail.com]
> > Sent: Sunday, December 08, 2013 2:29 PM
> > To: user@nutch.apache.org
> > Subject: Re: Unsuccessful fetch/parse of large page with many 
> > outlinks
> >
> > > debug code that I have inserted in a custom filter shows that the 
> > > file
> > that was retrieved is only 198 bytes long.
> > I am assuming that this code did not hinder the crawler. A better 
> > way to see the content would be to take a segment dump [0] and then 
> > analyse it.
> > Also, turn on DEBUG mode of the log4j for the http protocol classes 
> > and fetcher class.
> >
> > > attempted to crawl it from that site and it works fine, retrieving 
> > > all
> > 597KB and parsing it successfully.
> > You mean that you ran a nutch crawl with the problematic url as a 
> > seed and used the EXACT same config on both machines. One machine 
> > gave perfect content and the other one was not. Note that using 
> > EXACT same config over these 2 runs is important.
> >
> > > the page has about 350 characters of LineFeeds CarriageRetruns and 
> > > spaces
> > No way. The HTTP request gets a byte stream as response. Also, had 
> > it been the case that LF or CR chars create problem, then it must 
> > hit nutch irrespective of from which machine you run nutch...but 
> > thats not what your experiments suggest.
> >
> > [0] : http://wiki.apache.org/nutch/bin/nutch_readseg
> >
> >
> >
> > On Sun, Dec 8, 2013 at 11:23 AM, Iain Lopata <il...@hotmail.com>
> wrote:
> >
> > > I do not know whether this would be a factor, but I have noticed 
> > > that the page has about 350 characters of LineFeeds 
> > > CarriageRetruns and spaces before the <!DOCTYPE> declaration.  
> > > Could this be causing a problem for
> > > http-protocol in some way?   Howver, I can't explain why the same file
> > with
> > > the same LF, CR and whitespace would read correctly from a 
> > > different
> > host.
> > >
> > > -----Original Message-----
> > > From: Iain Lopata [mailto:ilopata1@hotmail.com]
> > > Sent: Sunday, December 08, 2013 12:06 PM
> > > To: user@nutch.apache.org
> > > Subject: Unsuccessful fetch/parse of large page with many outlinks
> > >
> > > I am running Nutch 1.6 on Ubuntu Server.
> > >
> > >
> > >
> > > I am experiencing a problem with one particular webpage.
> > >
> > >
> > >
> > > If I use parsechecker against the problem url the output shows 
> > > (host name changed to example.com):
> > >
> > >
> > >
> > > ================================================================
> > >
> > > fetching: http://www.example.com/index.cfm?pageID=12
> > >
> > > text/html
> > >
> > > parsing: http://www.example.com/index.cfm?pageID=12
> > >
> > > contentType: text/html
> > >
> > > signature: a9c640626fcad48caaf3ad5f94bea446
> > >
> > > ---------
> > >
> > > Url
> > >
> > > ---------------
> > >
> > > http://www.example.com/index.cfm?pageID=12
> > >
> > > ---------
> > >
> > > ParseData
> > >
> > > ---------
> > >
> > > Version: 5
> > >
> > > Status: success(1,0)
> > >
> > > Title:
> > >
> > > Outlinks: 0
> > >
> > > Content Metadata: Date=Sun, 08 Dec 2013 17:32:33 GMT 
> > > Set-Cookie=CFTOKEN=96208061;path=/ Content-Type=text/html;
> > > charset=UTF-8 Connection=close X-Powered-By=ASP.NET
> > > Server=Microsoft-IIS/6.0
> > >
> > > Parse Metadata: CharEncodingForConversion=utf-8
> > > OriginalCharEncoding=utf-8
> > >
> > > ==================================================================
> > > ==
> > > ==
> > > ==
> > >
> > >
> > >
> > > However, this page has 3775 outlinks.
> > >
> > >
> > >
> > > If I run a  crawl with this page as a seed the log file shows that 
> > > the file it fetched successfully, but debug code that I have 
> > > inserted in a custom filter shows that the file that was retrieved 
> > > is only 198 bytes long.  For some reason the file would seem to be 
> > > truncated or otherwise
> > corrupted.
> > >
> > >
> > >
> > > I can retrieve the file with wget and can see that the file is 597KB.
> > >
> > >
> > >
> > > I copied the file that I retrieved with wget to another web server 
> > > and attempted to crawl it from that site and it works fine, 
> > > retrieving all 597KB and parsing it successfully.  This would 
> > > suggest that my current configuration does not have a problem 
> > > processing
> this large file.
> > >
> > >
> > >
> > > I have checked the robots.txt file on the original host and it 
> > > allows retrieval of this web page.
> > >
> > >
> > >
> > > Other relevant configuration settings may be:
> > >
> > >
> > >
> > > <property>
> > >
> > >     <name>http.content.limit</name>
> > >
> > >     <value>-1</value>
> > >
> > > </property>
> > >
> > > <property>
> > >
> > >          <name>http.timeout</name>
> > >
> > >          <value>60000</value>
> > >
> > >          <description></description>
> > >
> > > </property>
> > >
> > >
> > >
> > > Any ideas on what to check next?
> > >
> > >
> > >
> > >
> > >
> >
> >
>
>
>

Re: Unsuccessful fetch/parse of large page with many outlinks

Posted by Tejas Patil <te...@gmail.com>.

I think that you narrowed it down and most probably its some
bug/incompatibility of the HTTP library which nutch uses to talk with the
server. Were both the servers where you hosted the url of IIS 6.0 ? If yes,
then there is more :)

Thanks,
Tejas


On Mon, Dec 9, 2013 at 3:32 PM, Iain Lopata <il...@hotmail.com> wrote:

> Out of ideas at this point.
>
> I can retrieve the page with Curl
> I can retrieve the page with Wget
> I can view the page in my browser
> I can retrieve the page by opening a socket from a PHP script
> I can retrieve the page with nutch if I move the page to another host
>
> But
>
> Any page I try and fetch from www.friedfrank.com with Nutch reads just 198
> bytes and then closes the stream.
>
> Debug code inserted in HttpResponse and WireShark both show that this is
> the
> case.
>
> Could someone else please try and fetch a page from this host from your
> config?
>
> My suspicion is that it is related to this host being on IIS 6.0 with this
> problem being a potential cause: http://support.microsoft.com/kb/919797
>
> -----Original Message-----
> From: Iain Lopata [mailto:ilopata1@hotmail.com]
> Sent: Monday, December 09, 2013 7:36 AM
> To: user@nutch.apache.org
> Subject: RE: Unsuccessful fetch/parse of large page with many outlinks
>
> Parses 652 outlinks from the ebay url without any difficulty.
>
> Didn't want to change the title and thereby break this thread, but at this
> point, and as stated in my last post,  I am reasonably confident that for
> some reason the InputReader in HttpResponse.java sees the stream as closed
> after reading only 198 bytes.  Why I do not know.
>
> -----Original Message-----
> From: S.L [mailto:simpleliving016@gmail.com]
> Sent: Sunday, December 08, 2013 11:44 PM
> To: user@nutch.apache.org
> Subject: Re: Unsuccessful fetch/parse of large page with many outlinks
>
> I faced a similar problem with this page
> http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1  when I was
> running Nutch from within eclipse , I was able to crawl all the outlinks
> successfully when I ran nutch as a jar outside of eclipse, at that point it
> was considered to be an issue with running ti in eclipse.
>
> Can you please try this URL with your setup ? this has atleast 600+
> outlinks.
>
>
> On Sun, Dec 8, 2013 at 10:07 PM, Iain Lopata <il...@hotmail.com> wrote:
>
> > Some further analysis - no solution.
> >
> > The pages in question do not return a Content-Length header.
> >
> > Since the http.content.limit is set to -1, http-protocol sets the
> > maximum read length to 2147483647.
> >
> > At line 231 of HttpResponse.java the loop:
> >
> > for (int i = in.read(bytes); i != -1 && length + i <= contentLength; i
> > =
> > in.read(bytes))
> >
> > executes once and once only and returns a stream of just 198 bytes.
> > No exceptions are thrown.
> >
> > So, I think, the question becomes why would this connection close
> > before the end of the stream?  It certainly seems to be server
> > specific since I can retrieve the file successfully from a different
> > host domain.
> >
> > -----Original Message-----
> > From: Tejas Patil [mailto:tejas.patil.cs@gmail.com]
> > Sent: Sunday, December 08, 2013 2:29 PM
> > To: user@nutch.apache.org
> > Subject: Re: Unsuccessful fetch/parse of large page with many outlinks
> >
> > > debug code that I have inserted in a custom filter shows that the
> > > file
> > that was retrieved is only 198 bytes long.
> > I am assuming that this code did not hinder the crawler. A better way
> > to see the content would be to take a segment dump [0] and then
> > analyse it.
> > Also, turn on DEBUG mode of the log4j for the http protocol classes
> > and fetcher class.
> >
> > > attempted to crawl it from that site and it works fine, retrieving
> > > all
> > 597KB and parsing it successfully.
> > You mean that you ran a nutch crawl with the problematic url as a seed
> > and used the EXACT same config on both machines. One machine gave
> > perfect content and the other one was not. Note that using EXACT same
> > config over these 2 runs is important.
> >
> > > the page has about 350 characters of LineFeeds CarriageRetruns and
> > > spaces
> > No way. The HTTP request gets a byte stream as response. Also, had it
> > been the case that LF or CR chars create problem, then it must hit
> > nutch irrespective of from which machine you run nutch...but thats not
> > what your experiments suggest.
> >
> > [0] : http://wiki.apache.org/nutch/bin/nutch_readseg
> >
> >
> >
> > On Sun, Dec 8, 2013 at 11:23 AM, Iain Lopata <il...@hotmail.com>
> wrote:
> >
> > > I do not know whether this would be a factor, but I have noticed
> > > that the page has about 350 characters of LineFeeds CarriageRetruns
> > > and spaces before the <!DOCTYPE> declaration.  Could this be causing
> > > a problem for
> > > http-protocol in some way?   Howver, I can't explain why the same file
> > with
> > > the same LF, CR and whitespace would read correctly from a different
> > host.
> > >
> > > -----Original Message-----
> > > From: Iain Lopata [mailto:ilopata1@hotmail.com]
> > > Sent: Sunday, December 08, 2013 12:06 PM
> > > To: user@nutch.apache.org
> > > Subject: Unsuccessful fetch/parse of large page with many outlinks
> > >
> > > I am running Nutch 1.6 on Ubuntu Server.
> > >
> > >
> > >
> > > I am experiencing a problem with one particular webpage.
> > >
> > >
> > >
> > > If I use parsechecker against the problem url the output shows (host
> > > name changed to example.com):
> > >
> > >
> > >
> > > ================================================================
> > >
> > > fetching: http://www.example.com/index.cfm?pageID=12
> > >
> > > text/html
> > >
> > > parsing: http://www.example.com/index.cfm?pageID=12
> > >
> > > contentType: text/html
> > >
> > > signature: a9c640626fcad48caaf3ad5f94bea446
> > >
> > > ---------
> > >
> > > Url
> > >
> > > ---------------
> > >
> > > http://www.example.com/index.cfm?pageID=12
> > >
> > > ---------
> > >
> > > ParseData
> > >
> > > ---------
> > >
> > > Version: 5
> > >
> > > Status: success(1,0)
> > >
> > > Title:
> > >
> > > Outlinks: 0
> > >
> > > Content Metadata: Date=Sun, 08 Dec 2013 17:32:33 GMT
> > > Set-Cookie=CFTOKEN=96208061;path=/ Content-Type=text/html;
> > > charset=UTF-8 Connection=close X-Powered-By=ASP.NET
> > > Server=Microsoft-IIS/6.0
> > >
> > > Parse Metadata: CharEncodingForConversion=utf-8
> > > OriginalCharEncoding=utf-8
> > >
> > > ====================================================================
> > > ==
> > > ==
> > >
> > >
> > >
> > > However, this page has 3775 outlinks.
> > >
> > >
> > >
> > > If I run a  crawl with this page as a seed the log file shows that
> > > the file it fetched successfully, but debug code that I have
> > > inserted in a custom filter shows that the file that was retrieved
> > > is only 198 bytes long.  For some reason the file would seem to be
> > > truncated or otherwise
> > corrupted.
> > >
> > >
> > >
> > > I can retrieve the file with wget and can see that the file is 597KB.
> > >
> > >
> > >
> > > I copied the file that I retrieved with wget to another web server
> > > and attempted to crawl it from that site and it works fine,
> > > retrieving all 597KB and parsing it successfully.  This would
> > > suggest that my current configuration does not have a problem
> > > processing
> this large file.
> > >
> > >
> > >
> > > I have checked the robots.txt file on the original host and it
> > > allows retrieval of this web page.
> > >
> > >
> > >
> > > Other relevant configuration settings may be:
> > >
> > >
> > >
> > > <property>
> > >
> > >     <name>http.content.limit</name>
> > >
> > >     <value>-1</value>
> > >
> > > </property>
> > >
> > > <property>
> > >
> > >          <name>http.timeout</name>
> > >
> > >          <value>60000</value>
> > >
> > >          <description></description>
> > >
> > > </property>
> > >
> > >
> > >
> > > Any ideas on what to check next?
> > >
> > >
> > >
> > >
> > >
> >
> >
>
>
>

RE: Unsuccessful fetch/parse of large page with many outlinks

Posted by Iain Lopata <il...@hotmail.com>.

Out of ideas at this point.

I can retrieve the page with Curl
I can retrieve the page with Wget
I can view the page in my browser
I can retrieve the page by opening a socket from a PHP script
I can retrieve the page with nutch if I move the page to another host

But

Any page I try and fetch from www.friedfrank.com with Nutch reads just 198
bytes and then closes the stream.

Debug code inserted in HttpResponse and WireShark both show that this is the
case.

Could someone else please try and fetch a page from this host from your
config?

My suspicion is that it is related to this host being on IIS 6.0 with this
problem being a potential cause: http://support.microsoft.com/kb/919797 

-----Original Message-----
From: Iain Lopata [mailto:ilopata1@hotmail.com] 
Sent: Monday, December 09, 2013 7:36 AM
To: user@nutch.apache.org
Subject: RE: Unsuccessful fetch/parse of large page with many outlinks

Parses 652 outlinks from the ebay url without any difficulty.

Didn't want to change the title and thereby break this thread, but at this
point, and as stated in my last post,  I am reasonably confident that for
some reason the InputReader in HttpResponse.java sees the stream as closed
after reading only 198 bytes.  Why I do not know.

-----Original Message-----
From: S.L [mailto:simpleliving016@gmail.com]
Sent: Sunday, December 08, 2013 11:44 PM
To: user@nutch.apache.org
Subject: Re: Unsuccessful fetch/parse of large page with many outlinks

I faced a similar problem with this page
http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1  when I was
running Nutch from within eclipse , I was able to crawl all the outlinks
successfully when I ran nutch as a jar outside of eclipse, at that point it
was considered to be an issue with running ti in eclipse.

Can you please try this URL with your setup ? this has atleast 600+
outlinks.


On Sun, Dec 8, 2013 at 10:07 PM, Iain Lopata <il...@hotmail.com> wrote:

> Some further analysis - no solution.
>
> The pages in question do not return a Content-Length header.
>
> Since the http.content.limit is set to -1, http-protocol sets the 
> maximum read length to 2147483647.
>
> At line 231 of HttpResponse.java the loop:
>
> for (int i = in.read(bytes); i != -1 && length + i <= contentLength; i 
> =
> in.read(bytes))
>
> executes once and once only and returns a stream of just 198 bytes.  
> No exceptions are thrown.
>
> So, I think, the question becomes why would this connection close 
> before the end of the stream?  It certainly seems to be server 
> specific since I can retrieve the file successfully from a different 
> host domain.
>
> -----Original Message-----
> From: Tejas Patil [mailto:tejas.patil.cs@gmail.com]
> Sent: Sunday, December 08, 2013 2:29 PM
> To: user@nutch.apache.org
> Subject: Re: Unsuccessful fetch/parse of large page with many outlinks
>
> > debug code that I have inserted in a custom filter shows that the 
> > file
> that was retrieved is only 198 bytes long.
> I am assuming that this code did not hinder the crawler. A better way 
> to see the content would be to take a segment dump [0] and then 
> analyse it.
> Also, turn on DEBUG mode of the log4j for the http protocol classes 
> and fetcher class.
>
> > attempted to crawl it from that site and it works fine, retrieving 
> > all
> 597KB and parsing it successfully.
> You mean that you ran a nutch crawl with the problematic url as a seed 
> and used the EXACT same config on both machines. One machine gave 
> perfect content and the other one was not. Note that using EXACT same 
> config over these 2 runs is important.
>
> > the page has about 350 characters of LineFeeds CarriageRetruns and 
> > spaces
> No way. The HTTP request gets a byte stream as response. Also, had it 
> been the case that LF or CR chars create problem, then it must hit 
> nutch irrespective of from which machine you run nutch...but thats not 
> what your experiments suggest.
>
> [0] : http://wiki.apache.org/nutch/bin/nutch_readseg
>
>
>
> On Sun, Dec 8, 2013 at 11:23 AM, Iain Lopata <il...@hotmail.com> wrote:
>
> > I do not know whether this would be a factor, but I have noticed 
> > that the page has about 350 characters of LineFeeds CarriageRetruns 
> > and spaces before the <!DOCTYPE> declaration.  Could this be causing 
> > a problem for
> > http-protocol in some way?   Howver, I can't explain why the same file
> with
> > the same LF, CR and whitespace would read correctly from a different
> host.
> >
> > -----Original Message-----
> > From: Iain Lopata [mailto:ilopata1@hotmail.com]
> > Sent: Sunday, December 08, 2013 12:06 PM
> > To: user@nutch.apache.org
> > Subject: Unsuccessful fetch/parse of large page with many outlinks
> >
> > I am running Nutch 1.6 on Ubuntu Server.
> >
> >
> >
> > I am experiencing a problem with one particular webpage.
> >
> >
> >
> > If I use parsechecker against the problem url the output shows (host 
> > name changed to example.com):
> >
> >
> >
> > ================================================================
> >
> > fetching: http://www.example.com/index.cfm?pageID=12
> >
> > text/html
> >
> > parsing: http://www.example.com/index.cfm?pageID=12
> >
> > contentType: text/html
> >
> > signature: a9c640626fcad48caaf3ad5f94bea446
> >
> > ---------
> >
> > Url
> >
> > ---------------
> >
> > http://www.example.com/index.cfm?pageID=12
> >
> > ---------
> >
> > ParseData
> >
> > ---------
> >
> > Version: 5
> >
> > Status: success(1,0)
> >
> > Title:
> >
> > Outlinks: 0
> >
> > Content Metadata: Date=Sun, 08 Dec 2013 17:32:33 GMT 
> > Set-Cookie=CFTOKEN=96208061;path=/ Content-Type=text/html;
> > charset=UTF-8 Connection=close X-Powered-By=ASP.NET
> > Server=Microsoft-IIS/6.0
> >
> > Parse Metadata: CharEncodingForConversion=utf-8
> > OriginalCharEncoding=utf-8
> >
> > ====================================================================
> > ==
> > ==
> >
> >
> >
> > However, this page has 3775 outlinks.
> >
> >
> >
> > If I run a  crawl with this page as a seed the log file shows that 
> > the file it fetched successfully, but debug code that I have 
> > inserted in a custom filter shows that the file that was retrieved 
> > is only 198 bytes long.  For some reason the file would seem to be 
> > truncated or otherwise
> corrupted.
> >
> >
> >
> > I can retrieve the file with wget and can see that the file is 597KB.
> >
> >
> >
> > I copied the file that I retrieved with wget to another web server 
> > and attempted to crawl it from that site and it works fine, 
> > retrieving all 597KB and parsing it successfully.  This would 
> > suggest that my current configuration does not have a problem 
> > processing
this large file.
> >
> >
> >
> > I have checked the robots.txt file on the original host and it 
> > allows retrieval of this web page.
> >
> >
> >
> > Other relevant configuration settings may be:
> >
> >
> >
> > <property>
> >
> >     <name>http.content.limit</name>
> >
> >     <value>-1</value>
> >
> > </property>
> >
> > <property>
> >
> >          <name>http.timeout</name>
> >
> >          <value>60000</value>
> >
> >          <description></description>
> >
> > </property>
> >
> >
> >
> > Any ideas on what to check next?
> >
> >
> >
> >
> >
>
>

RE: Unsuccessful fetch/parse of large page with many outlinks

Posted by Iain Lopata <il...@hotmail.com>.

Parses 652 outlinks from the ebay url without any difficulty.

Didn't want to change the title and thereby break this thread, but at this
point, and as stated in my last post,  I am reasonably confident that for
some reason the InputReader in HttpResponse.java sees the stream as closed
after reading only 198 bytes.  Why I do not know.

-----Original Message-----
From: S.L [mailto:simpleliving016@gmail.com] 
Sent: Sunday, December 08, 2013 11:44 PM
To: user@nutch.apache.org
Subject: Re: Unsuccessful fetch/parse of large page with many outlinks

I faced a similar problem with this page
http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1  when I was
running Nutch from within eclipse , I was able to crawl all the outlinks
successfully when I ran nutch as a jar outside of eclipse, at that point it
was considered to be an issue with running ti in eclipse.

Can you please try this URL with your setup ? this has atleast 600+
outlinks.


On Sun, Dec 8, 2013 at 10:07 PM, Iain Lopata <il...@hotmail.com> wrote:

> Some further analysis - no solution.
>
> The pages in question do not return a Content-Length header.
>
> Since the http.content.limit is set to -1, http-protocol sets the 
> maximum read length to 2147483647.
>
> At line 231 of HttpResponse.java the loop:
>
> for (int i = in.read(bytes); i != -1 && length + i <= contentLength; i 
> =
> in.read(bytes))
>
> executes once and once only and returns a stream of just 198 bytes.  
> No exceptions are thrown.
>
> So, I think, the question becomes why would this connection close 
> before the end of the stream?  It certainly seems to be server 
> specific since I can retrieve the file successfully from a different 
> host domain.
>
> -----Original Message-----
> From: Tejas Patil [mailto:tejas.patil.cs@gmail.com]
> Sent: Sunday, December 08, 2013 2:29 PM
> To: user@nutch.apache.org
> Subject: Re: Unsuccessful fetch/parse of large page with many outlinks
>
> > debug code that I have inserted in a custom filter shows that the 
> > file
> that was retrieved is only 198 bytes long.
> I am assuming that this code did not hinder the crawler. A better way 
> to see the content would be to take a segment dump [0] and then 
> analyse it.
> Also, turn on DEBUG mode of the log4j for the http protocol classes 
> and fetcher class.
>
> > attempted to crawl it from that site and it works fine, retrieving 
> > all
> 597KB and parsing it successfully.
> You mean that you ran a nutch crawl with the problematic url as a seed 
> and used the EXACT same config on both machines. One machine gave 
> perfect content and the other one was not. Note that using EXACT same 
> config over these 2 runs is important.
>
> > the page has about 350 characters of LineFeeds CarriageRetruns and 
> > spaces
> No way. The HTTP request gets a byte stream as response. Also, had it 
> been the case that LF or CR chars create problem, then it must hit 
> nutch irrespective of from which machine you run nutch...but thats not 
> what your experiments suggest.
>
> [0] : http://wiki.apache.org/nutch/bin/nutch_readseg
>
>
>
> On Sun, Dec 8, 2013 at 11:23 AM, Iain Lopata <il...@hotmail.com> wrote:
>
> > I do not know whether this would be a factor, but I have noticed 
> > that the page has about 350 characters of LineFeeds CarriageRetruns 
> > and spaces before the <!DOCTYPE> declaration.  Could this be causing 
> > a problem for
> > http-protocol in some way?   Howver, I can't explain why the same file
> with
> > the same LF, CR and whitespace would read correctly from a different
> host.
> >
> > -----Original Message-----
> > From: Iain Lopata [mailto:ilopata1@hotmail.com]
> > Sent: Sunday, December 08, 2013 12:06 PM
> > To: user@nutch.apache.org
> > Subject: Unsuccessful fetch/parse of large page with many outlinks
> >
> > I am running Nutch 1.6 on Ubuntu Server.
> >
> >
> >
> > I am experiencing a problem with one particular webpage.
> >
> >
> >
> > If I use parsechecker against the problem url the output shows (host 
> > name changed to example.com):
> >
> >
> >
> > ================================================================
> >
> > fetching: http://www.example.com/index.cfm?pageID=12
> >
> > text/html
> >
> > parsing: http://www.example.com/index.cfm?pageID=12
> >
> > contentType: text/html
> >
> > signature: a9c640626fcad48caaf3ad5f94bea446
> >
> > ---------
> >
> > Url
> >
> > ---------------
> >
> > http://www.example.com/index.cfm?pageID=12
> >
> > ---------
> >
> > ParseData
> >
> > ---------
> >
> > Version: 5
> >
> > Status: success(1,0)
> >
> > Title:
> >
> > Outlinks: 0
> >
> > Content Metadata: Date=Sun, 08 Dec 2013 17:32:33 GMT 
> > Set-Cookie=CFTOKEN=96208061;path=/ Content-Type=text/html;
> > charset=UTF-8 Connection=close X-Powered-By=ASP.NET
> > Server=Microsoft-IIS/6.0
> >
> > Parse Metadata: CharEncodingForConversion=utf-8
> > OriginalCharEncoding=utf-8
> >
> > ====================================================================
> > ==
> > ==
> >
> >
> >
> > However, this page has 3775 outlinks.
> >
> >
> >
> > If I run a  crawl with this page as a seed the log file shows that 
> > the file it fetched successfully, but debug code that I have 
> > inserted in a custom filter shows that the file that was retrieved 
> > is only 198 bytes long.  For some reason the file would seem to be 
> > truncated or otherwise
> corrupted.
> >
> >
> >
> > I can retrieve the file with wget and can see that the file is 597KB.
> >
> >
> >
> > I copied the file that I retrieved with wget to another web server 
> > and attempted to crawl it from that site and it works fine, 
> > retrieving all 597KB and parsing it successfully.  This would 
> > suggest that my current configuration does not have a problem processing
this large file.
> >
> >
> >
> > I have checked the robots.txt file on the original host and it 
> > allows retrieval of this web page.
> >
> >
> >
> > Other relevant configuration settings may be:
> >
> >
> >
> > <property>
> >
> >     <name>http.content.limit</name>
> >
> >     <value>-1</value>
> >
> > </property>
> >
> > <property>
> >
> >          <name>http.timeout</name>
> >
> >          <value>60000</value>
> >
> >          <description></description>
> >
> > </property>
> >
> >
> >
> > Any ideas on what to check next?
> >
> >
> >
> >
> >
>
>

Re: Unsuccessful fetch/parse of large page with many outlinks

Posted by "S.L" <si...@gmail.com>.

I faced a similar problem with this page
http://www.ebay.com/sch/allcategories/all-categories/?_rdc=1  when I was
running Nutch from within eclipse , I was able to crawl all the outlinks
successfully when I ran nutch as a jar outside of eclipse, at that point it
was considered to be an issue with running ti in eclipse.

Can you please try this URL with your setup ? this has atleast 600+
outlinks.


On Sun, Dec 8, 2013 at 10:07 PM, Iain Lopata <il...@hotmail.com> wrote:

> Some further analysis - no solution.
>
> The pages in question do not return a Content-Length header.
>
> Since the http.content.limit is set to -1, http-protocol sets the maximum
> read length to 2147483647.
>
> At line 231 of HttpResponse.java the loop:
>
> for (int i = in.read(bytes); i != -1 && length + i <= contentLength; i =
> in.read(bytes))
>
> executes once and once only and returns a stream of just 198 bytes.  No
> exceptions are thrown.
>
> So, I think, the question becomes why would this connection close before
> the
> end of the stream?  It certainly seems to be server specific since I can
> retrieve the file successfully from a different host domain.
>
> -----Original Message-----
> From: Tejas Patil [mailto:tejas.patil.cs@gmail.com]
> Sent: Sunday, December 08, 2013 2:29 PM
> To: user@nutch.apache.org
> Subject: Re: Unsuccessful fetch/parse of large page with many outlinks
>
> > debug code that I have inserted in a custom filter shows that the file
> that was retrieved is only 198 bytes long.
> I am assuming that this code did not hinder the crawler. A better way to
> see
> the content would be to take a segment dump [0] and then analyse it.
> Also, turn on DEBUG mode of the log4j for the http protocol classes and
> fetcher class.
>
> > attempted to crawl it from that site and it works fine, retrieving all
> 597KB and parsing it successfully.
> You mean that you ran a nutch crawl with the problematic url as a seed and
> used the EXACT same config on both machines. One machine gave perfect
> content and the other one was not. Note that using EXACT same config over
> these 2 runs is important.
>
> > the page has about 350 characters of LineFeeds CarriageRetruns and
> > spaces
> No way. The HTTP request gets a byte stream as response. Also, had it been
> the case that LF or CR chars create problem, then it must hit nutch
> irrespective of from which machine you run nutch...but thats not what your
> experiments suggest.
>
> [0] : http://wiki.apache.org/nutch/bin/nutch_readseg
>
>
>
> On Sun, Dec 8, 2013 at 11:23 AM, Iain Lopata <il...@hotmail.com> wrote:
>
> > I do not know whether this would be a factor, but I have noticed that
> > the page has about 350 characters of LineFeeds CarriageRetruns and
> > spaces before the <!DOCTYPE> declaration.  Could this be causing a
> > problem for
> > http-protocol in some way?   Howver, I can't explain why the same file
> with
> > the same LF, CR and whitespace would read correctly from a different
> host.
> >
> > -----Original Message-----
> > From: Iain Lopata [mailto:ilopata1@hotmail.com]
> > Sent: Sunday, December 08, 2013 12:06 PM
> > To: user@nutch.apache.org
> > Subject: Unsuccessful fetch/parse of large page with many outlinks
> >
> > I am running Nutch 1.6 on Ubuntu Server.
> >
> >
> >
> > I am experiencing a problem with one particular webpage.
> >
> >
> >
> > If I use parsechecker against the problem url the output shows (host
> > name changed to example.com):
> >
> >
> >
> > ================================================================
> >
> > fetching: http://www.example.com/index.cfm?pageID=12
> >
> > text/html
> >
> > parsing: http://www.example.com/index.cfm?pageID=12
> >
> > contentType: text/html
> >
> > signature: a9c640626fcad48caaf3ad5f94bea446
> >
> > ---------
> >
> > Url
> >
> > ---------------
> >
> > http://www.example.com/index.cfm?pageID=12
> >
> > ---------
> >
> > ParseData
> >
> > ---------
> >
> > Version: 5
> >
> > Status: success(1,0)
> >
> > Title:
> >
> > Outlinks: 0
> >
> > Content Metadata: Date=Sun, 08 Dec 2013 17:32:33 GMT
> > Set-Cookie=CFTOKEN=96208061;path=/ Content-Type=text/html;
> > charset=UTF-8 Connection=close X-Powered-By=ASP.NET
> > Server=Microsoft-IIS/6.0
> >
> > Parse Metadata: CharEncodingForConversion=utf-8
> > OriginalCharEncoding=utf-8
> >
> > ======================================================================
> > ==
> >
> >
> >
> > However, this page has 3775 outlinks.
> >
> >
> >
> > If I run a  crawl with this page as a seed the log file shows that the
> > file it fetched successfully, but debug code that I have inserted in a
> > custom filter shows that the file that was retrieved is only 198 bytes
> > long.  For some reason the file would seem to be truncated or otherwise
> corrupted.
> >
> >
> >
> > I can retrieve the file with wget and can see that the file is 597KB.
> >
> >
> >
> > I copied the file that I retrieved with wget to another web server and
> > attempted to crawl it from that site and it works fine, retrieving all
> > 597KB and parsing it successfully.  This would suggest that my current
> > configuration does not have a problem processing this large file.
> >
> >
> >
> > I have checked the robots.txt file on the original host and it allows
> > retrieval of this web page.
> >
> >
> >
> > Other relevant configuration settings may be:
> >
> >
> >
> > <property>
> >
> >     <name>http.content.limit</name>
> >
> >     <value>-1</value>
> >
> > </property>
> >
> > <property>
> >
> >          <name>http.timeout</name>
> >
> >          <value>60000</value>
> >
> >          <description></description>
> >
> > </property>
> >
> >
> >
> > Any ideas on what to check next?
> >
> >
> >
> >
> >
>
>

RE: Unsuccessful fetch/parse of large page with many outlinks

Posted by Iain Lopata <il...@hotmail.com>.

Some further analysis - no solution.

The pages in question do not return a Content-Length header.

Since the http.content.limit is set to -1, http-protocol sets the maximum
read length to 2147483647.

At line 231 of HttpResponse.java the loop:

for (int i = in.read(bytes); i != -1 && length + i <= contentLength; i =
in.read(bytes))

executes once and once only and returns a stream of just 198 bytes.  No
exceptions are thrown.

So, I think, the question becomes why would this connection close before the
end of the stream?  It certainly seems to be server specific since I can
retrieve the file successfully from a different host domain.

-----Original Message-----
From: Tejas Patil [mailto:tejas.patil.cs@gmail.com] 
Sent: Sunday, December 08, 2013 2:29 PM
To: user@nutch.apache.org
Subject: Re: Unsuccessful fetch/parse of large page with many outlinks

> debug code that I have inserted in a custom filter shows that the file
that was retrieved is only 198 bytes long.
I am assuming that this code did not hinder the crawler. A better way to see
the content would be to take a segment dump [0] and then analyse it.
Also, turn on DEBUG mode of the log4j for the http protocol classes and
fetcher class.

> attempted to crawl it from that site and it works fine, retrieving all
597KB and parsing it successfully.
You mean that you ran a nutch crawl with the problematic url as a seed and
used the EXACT same config on both machines. One machine gave perfect
content and the other one was not. Note that using EXACT same config over
these 2 runs is important.

> the page has about 350 characters of LineFeeds CarriageRetruns and 
> spaces
No way. The HTTP request gets a byte stream as response. Also, had it been
the case that LF or CR chars create problem, then it must hit nutch
irrespective of from which machine you run nutch...but thats not what your
experiments suggest.

[0] : http://wiki.apache.org/nutch/bin/nutch_readseg



On Sun, Dec 8, 2013 at 11:23 AM, Iain Lopata <il...@hotmail.com> wrote:

> I do not know whether this would be a factor, but I have noticed that 
> the page has about 350 characters of LineFeeds CarriageRetruns and 
> spaces before the <!DOCTYPE> declaration.  Could this be causing a 
> problem for
> http-protocol in some way?   Howver, I can't explain why the same file
with
> the same LF, CR and whitespace would read correctly from a different host.
>
> -----Original Message-----
> From: Iain Lopata [mailto:ilopata1@hotmail.com]
> Sent: Sunday, December 08, 2013 12:06 PM
> To: user@nutch.apache.org
> Subject: Unsuccessful fetch/parse of large page with many outlinks
>
> I am running Nutch 1.6 on Ubuntu Server.
>
>
>
> I am experiencing a problem with one particular webpage.
>
>
>
> If I use parsechecker against the problem url the output shows (host 
> name changed to example.com):
>
>
>
> ================================================================
>
> fetching: http://www.example.com/index.cfm?pageID=12
>
> text/html
>
> parsing: http://www.example.com/index.cfm?pageID=12
>
> contentType: text/html
>
> signature: a9c640626fcad48caaf3ad5f94bea446
>
> ---------
>
> Url
>
> ---------------
>
> http://www.example.com/index.cfm?pageID=12
>
> ---------
>
> ParseData
>
> ---------
>
> Version: 5
>
> Status: success(1,0)
>
> Title:
>
> Outlinks: 0
>
> Content Metadata: Date=Sun, 08 Dec 2013 17:32:33 GMT 
> Set-Cookie=CFTOKEN=96208061;path=/ Content-Type=text/html; 
> charset=UTF-8 Connection=close X-Powered-By=ASP.NET 
> Server=Microsoft-IIS/6.0
>
> Parse Metadata: CharEncodingForConversion=utf-8 
> OriginalCharEncoding=utf-8
>
> ======================================================================
> ==
>
>
>
> However, this page has 3775 outlinks.
>
>
>
> If I run a  crawl with this page as a seed the log file shows that the 
> file it fetched successfully, but debug code that I have inserted in a 
> custom filter shows that the file that was retrieved is only 198 bytes 
> long.  For some reason the file would seem to be truncated or otherwise
corrupted.
>
>
>
> I can retrieve the file with wget and can see that the file is 597KB.
>
>
>
> I copied the file that I retrieved with wget to another web server and 
> attempted to crawl it from that site and it works fine, retrieving all 
> 597KB and parsing it successfully.  This would suggest that my current 
> configuration does not have a problem processing this large file.
>
>
>
> I have checked the robots.txt file on the original host and it allows 
> retrieval of this web page.
>
>
>
> Other relevant configuration settings may be:
>
>
>
> <property>
>
>     <name>http.content.limit</name>
>
>     <value>-1</value>
>
> </property>
>
> <property>
>
>          <name>http.timeout</name>
>
>          <value>60000</value>
>
>          <description></description>
>
> </property>
>
>
>
> Any ideas on what to check next?
>
>
>
>
>

Re: Unsuccessful fetch/parse of large page with many outlinks

Posted by Tejas Patil <te...@gmail.com>.

> debug code that I have inserted in a custom filter shows that the file
that was retrieved is only 198 bytes long.
I am assuming that this code did not hinder the crawler. A better way to
see the content would be to take a segment dump [0] and then analyse it.
Also, turn on DEBUG mode of the log4j for the http protocol classes and
fetcher class.

> attempted to crawl it from that site and it works fine, retrieving all
597KB and parsing it successfully.
You mean that you ran a nutch crawl with the problematic url as a seed and
used the EXACT same config on both machines. One machine gave perfect
content and the other one was not. Note that using EXACT same config over
these 2 runs is important.

> the page has about 350 characters of LineFeeds CarriageRetruns and spaces
No way. The HTTP request gets a byte stream as response. Also, had it been
the case that LF or CR chars create problem, then it must hit nutch
irrespective of from which machine you run nutch...but thats not what your
experiments suggest.

[0] : http://wiki.apache.org/nutch/bin/nutch_readseg



On Sun, Dec 8, 2013 at 11:23 AM, Iain Lopata <il...@hotmail.com> wrote:

> I do not know whether this would be a factor, but I have noticed that the
> page has about 350 characters of LineFeeds CarriageRetruns and spaces
> before
> the <!DOCTYPE> declaration.  Could this be causing a problem for
> http-protocol in some way?   Howver, I can't explain why the same file with
> the same LF, CR and whitespace would read correctly from a different host.
>
> -----Original Message-----
> From: Iain Lopata [mailto:ilopata1@hotmail.com]
> Sent: Sunday, December 08, 2013 12:06 PM
> To: user@nutch.apache.org
> Subject: Unsuccessful fetch/parse of large page with many outlinks
>
> I am running Nutch 1.6 on Ubuntu Server.
>
>
>
> I am experiencing a problem with one particular webpage.
>
>
>
> If I use parsechecker against the problem url the output shows (host name
> changed to example.com):
>
>
>
> ================================================================
>
> fetching: http://www.example.com/index.cfm?pageID=12
>
> text/html
>
> parsing: http://www.example.com/index.cfm?pageID=12
>
> contentType: text/html
>
> signature: a9c640626fcad48caaf3ad5f94bea446
>
> ---------
>
> Url
>
> ---------------
>
> http://www.example.com/index.cfm?pageID=12
>
> ---------
>
> ParseData
>
> ---------
>
> Version: 5
>
> Status: success(1,0)
>
> Title:
>
> Outlinks: 0
>
> Content Metadata: Date=Sun, 08 Dec 2013 17:32:33 GMT
> Set-Cookie=CFTOKEN=96208061;path=/ Content-Type=text/html; charset=UTF-8
> Connection=close X-Powered-By=ASP.NET Server=Microsoft-IIS/6.0
>
> Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8
>
> ========================================================================
>
>
>
> However, this page has 3775 outlinks.
>
>
>
> If I run a  crawl with this page as a seed the log file shows that the file
> it fetched successfully, but debug code that I have inserted in a custom
> filter shows that the file that was retrieved is only 198 bytes long.  For
> some reason the file would seem to be truncated or otherwise corrupted.
>
>
>
> I can retrieve the file with wget and can see that the file is 597KB.
>
>
>
> I copied the file that I retrieved with wget to another web server and
> attempted to crawl it from that site and it works fine, retrieving all
> 597KB
> and parsing it successfully.  This would suggest that my current
> configuration does not have a problem processing this large file.
>
>
>
> I have checked the robots.txt file on the original host and it allows
> retrieval of this web page.
>
>
>
> Other relevant configuration settings may be:
>
>
>
> <property>
>
>     <name>http.content.limit</name>
>
>     <value>-1</value>
>
> </property>
>
> <property>
>
>          <name>http.timeout</name>
>
>          <value>60000</value>
>
>          <description></description>
>
> </property>
>
>
>
> Any ideas on what to check next?
>
>
>
>
>

RE: Unsuccessful fetch/parse of large page with many outlinks

Posted by Iain Lopata <il...@hotmail.com>.

I do not know whether this would be a factor, but I have noticed that the
page has about 350 characters of LineFeeds CarriageRetruns and spaces before
the <!DOCTYPE> declaration.  Could this be causing a problem for
http-protocol in some way?   Howver, I can't explain why the same file with
the same LF, CR and whitespace would read correctly from a different host.

-----Original Message-----
From: Iain Lopata [mailto:ilopata1@hotmail.com] 
Sent: Sunday, December 08, 2013 12:06 PM
To: user@nutch.apache.org
Subject: Unsuccessful fetch/parse of large page with many outlinks

I am running Nutch 1.6 on Ubuntu Server.

I am experiencing a problem with one particular webpage.

If I use parsechecker against the problem url the output shows (host name
changed to example.com):

================================================================

fetching: http://www.example.com/index.cfm?pageID=12

text/html

parsing: http://www.example.com/index.cfm?pageID=12

contentType: text/html

signature: a9c640626fcad48caaf3ad5f94bea446

---------

Url

---------------

http://www.example.com/index.cfm?pageID=12

---------

ParseData

---------

Version: 5

Status: success(1,0)

Title:

Outlinks: 0

Content Metadata: Date=Sun, 08 Dec 2013 17:32:33 GMT
Set-Cookie=CFTOKEN=96208061;path=/ Content-Type=text/html; charset=UTF-8
Connection=close X-Powered-By=ASP.NET Server=Microsoft-IIS/6.0

Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8

========================================================================

However, this page has 3775 outlinks.

If I run a  crawl with this page as a seed the log file shows that the file
it fetched successfully, but debug code that I have inserted in a custom
filter shows that the file that was retrieved is only 198 bytes long.  For
some reason the file would seem to be truncated or otherwise corrupted.

I can retrieve the file with wget and can see that the file is 597KB.

I copied the file that I retrieved with wget to another web server and
attempted to crawl it from that site and it works fine, retrieving all 597KB
and parsing it successfully.  This would suggest that my current
configuration does not have a problem processing this large file.

I have checked the robots.txt file on the original host and it allows
retrieval of this web page.

Other relevant configuration settings may be:

<property>

    <name>http.content.limit</name>

    <value>-1</value>

</property>

<property>

         <name>http.timeout</name>

         <value>60000</value>

         <description></description>

</property>

Any ideas on what to check next?