You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Chris Gray <cp...@uwaterloo.ca> on 2018/05/23 17:38:34 UTC

Problems starting crawl from sitemaps

I've been using nutch for a few years to do conventional link-to-link 
crawls of our local websites, but I would like to switch to doing crawls 
based on sitemaps.  So far I've had no luck doing this.

I'm not sure I've configured this correctly and the documentation I've 
found has left me guessing at many things.  Why aren't the pages in 
listed in a sitemap being fetched and indexed?

I've installed Nutch 1.14 and Solr 6.6.0.  My urls/seeds.txt file 
contains only the URLs for the 4 sitemaps I'm interested in.  After running:

bin/crawl -i -D "solr.server.url=http://localhost:8983/solr/nutch" -s 
urls crawl 5

the crawl ends after 3 of 5 iterations and only 3 documents are in the 
index:  3 of the seeds.

I do get error messages that 3 sitemap files that contain <urlset> 
elements are malformed, for example:

2018-05-23 08:57:24,564 ERROR tika.TikaParser - Error parsing 
https://uwaterloo.ca/library/sitemap.xml
Caused by: org.xml.sax.SAXParseException; lineNumber: 420; columnNumber: 
122; XML document structures must start and end within the same entity.

But I can't find anything wrong with the sitemaps and other validators 
say they're OK and the location pointed to (line 420, column 122) is in 
the middle of the name of a directory in a URL.

Is there good documentation or a tutorial on using Nutch with sitemaps?

Re: Problems starting crawl from sitemaps

Posted by Chris Gray <cp...@uwaterloo.ca>.

Many thanks, Yossi!

I was beginning to think the problems were along those lines (truncating 
long XML files and needing to run bin/nutch sitemap first). Thanks for 
clarifying how to fix the issues.  I'm successfully running a crawl now.

Chris

On 2018-05-24 07:19 AM, Yossi Tamari wrote:
> Hi Chris,
>
> In order to inject sitemaps, you should use the "nutch sitemap" command. After you inject those sitemaps to the crawl DB, you can proceed as normal with the crawl command, without the -s parameter.
> The error you are seeing may be because you have http.content.limit defined. The default value would cause any document to be truncated after 65536 bytes. For sitemaps, you should set it to a much larger number, or -1.
>
> 	 Yossi.
>
>> -----Original Message-----
>> From: Chris Gray <cp...@uwaterloo.ca>
>> Sent: 23 May 2018 20:39
>> To: user@nutch.apache.org
>> Subject: Problems starting crawl from sitemaps
>>
>> I've been using nutch for a few years to do conventional link-to-link crawls of
>> our local websites, but I would like to switch to doing crawls based on
>> sitemaps.  So far I've had no luck doing this.
>>
>> I'm not sure I've configured this correctly and the documentation I've found has
>> left me guessing at many things.  Why aren't the pages in listed in a sitemap
>> being fetched and indexed?
>>
>> I've installed Nutch 1.14 and Solr 6.6.0.  My urls/seeds.txt file contains only the
>> URLs for the 4 sitemaps I'm interested in.  After running:
>>
>> bin/crawl -i -D "solr.server.url=http://localhost:8983/solr/nutch" -s urls crawl 5
>>
>> the crawl ends after 3 of 5 iterations and only 3 documents are in the
>> index:  3 of the seeds.
>>
>> I do get error messages that 3 sitemap files that contain <urlset> elements are
>> malformed, for example:
>>
>> 2018-05-23 08:57:24,564 ERROR tika.TikaParser - Error parsing
>> https://uwaterloo.ca/library/sitemap.xml
>> Caused by: org.xml.sax.SAXParseException; lineNumber: 420; columnNumber:
>> 122; XML document structures must start and end within the same entity.
>>
>> But I can't find anything wrong with the sitemaps and other validators say
>> they're OK and the location pointed to (line 420, column 122) is in the middle of
>> the name of a directory in a URL.
>>
>> Is there good documentation or a tutorial on using Nutch with sitemaps?
>>
>

RE: Problems starting crawl from sitemaps

Posted by Yossi Tamari <yo...@pipl.com>.

Hi Chris,

In order to inject sitemaps, you should use the "nutch sitemap" command. After you inject those sitemaps to the crawl DB, you can proceed as normal with the crawl command, without the -s parameter.
The error you are seeing may be because you have http.content.limit defined. The default value would cause any document to be truncated after 65536 bytes. For sitemaps, you should set it to a much larger number, or -1.

	 Yossi.

> -----Original Message-----
> From: Chris Gray <cp...@uwaterloo.ca>
> Sent: 23 May 2018 20:39
> To: user@nutch.apache.org
> Subject: Problems starting crawl from sitemaps
> 
> I've been using nutch for a few years to do conventional link-to-link crawls of
> our local websites, but I would like to switch to doing crawls based on
> sitemaps.  So far I've had no luck doing this.
> 
> I'm not sure I've configured this correctly and the documentation I've found has
> left me guessing at many things.  Why aren't the pages in listed in a sitemap
> being fetched and indexed?
> 
> I've installed Nutch 1.14 and Solr 6.6.0.  My urls/seeds.txt file contains only the
> URLs for the 4 sitemaps I'm interested in.  After running:
> 
> bin/crawl -i -D "solr.server.url=http://localhost:8983/solr/nutch" -s urls crawl 5
> 
> the crawl ends after 3 of 5 iterations and only 3 documents are in the
> index:  3 of the seeds.
> 
> I do get error messages that 3 sitemap files that contain <urlset> elements are
> malformed, for example:
> 
> 2018-05-23 08:57:24,564 ERROR tika.TikaParser - Error parsing
> https://uwaterloo.ca/library/sitemap.xml
> Caused by: org.xml.sax.SAXParseException; lineNumber: 420; columnNumber:
> 122; XML document structures must start and end within the same entity.
> 
> But I can't find anything wrong with the sitemaps and other validators say
> they're OK and the location pointed to (line 420, column 122) is in the middle of
> the name of a directory in a URL.
> 
> Is there good documentation or a tutorial on using Nutch with sitemaps?
>