You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Barnabás Balázs <ba...@impresign.com> on 2017/08/17 10:00:32 UTC

Crawl issues and Custom IndexWriter never called on index command solution

Dear Nutch Community,
Thanks for the answer, as it turned out my issue was with the segments, I used the folder in which my segments reside as a parameter instead of the specific segment folder. Changing this solved my issue, so thank you. 
I spent a few days playing around with my crawl setup and ran into a few issues. For the crawl i'm using the libselenium plugin and phantomjs. In the usual setup the generate runs with topN=10000 and the fetcher gets 100 threads.
1. I set the topN parameter of our generate job to a low number (100) to test how well the adaptive crawling changes around the crawled URLs. What I experienced was that mostly the same URLs were fetched and only few (8-10) changed between crawls. The following relevant configurations are set:
db.fetch.interval.default = "3600" db.fetch.schedule.class = "org.apache.nutch.crawl.AdaptiveFetchSchedule" db.fetch.schedule.adaptive.min_interval = "1200" db.fetch.schedule.adaptive.inc_rate = "0.4" db.fetch.schedule.adaptive.dec_rate = "0.2" db.fetch.schedule.adaptive.sync_delta = "true" db.fetch.schedule.adaptive.sync_delta_rate = "0.3"

The main goal would be to have a large scale crawl that only fetches often-changing sites regularly.
2.  I'm seeing quite a lot of UnkownHostException errors for example:
2017-08-16 15:23:28,099 INFO [FetcherThread] org.apache.nutch.protocol.http.api.HttpRobotRulesParser: Couldn't get robots.txt for http://www.tomgeorge.hu/phplapozo/tgkepek.php: java.net.UnknownHostException: www.tomgeorge.hu 2017-08-16 15:23:28,099 ERROR [FetcherThread] org.apache.nutch.protocol.selenium.Http: Failed to get protocol output

I managed to ping the failing sites on the machine where the jobs are running, so I'm not sure how it could be a DNS setting.

3. Selenium throws NoSuchElementExceptions:
2017-08-16 15:28:14,963 ERROR [FetcherThread] org.apache.nutch.protocol.selenium.Http: Failed to get protocol output java.lang.RuntimeException: org.openqa.selenium.NoSuchElementException: {"errorMessage":"Unable to find element with tag name 'body'","request":{"headers":{"Accept-Encoding":"gzip,deflate","Connection":"Keep-Alive","Content-Length":"35","Content-Type":"application/json; charset=utf-8","Host":"localhost:32377","User-Agent":"Apache-HttpClient/4.5.2 (Java/1.8.0_144)"},"httpVersion":"1.1","method":"POST","post":"{\"using\":\"tag name\",\"value\":\"body\"}","url":"/element","urlParsed":{"anchor":"","query":"","file":"element","directory":"/","path":"/element","relative":"/element","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/element","queryKey":{},"chunks":["element"]},"urlOriginal":"/session/7eec7df0-8297-11e7-b678-71c58580ccf2/element"}} Command duration or timeout: 288 milliseconds

I'm seeing this one a lot, not sure what the issue could be. Opened some of the links and there is definitely a <body> tag in their HTML.

4. Encoding issues pop up with some of the robots.txt files. I checked out of multiple of these and they seem fine (no special characters or anything as far as I can tell). This is what the exception looks like:
2017-08-16 15:24:37,765 WARN [FetcherThread] crawlercommons.robots.SimpleRobotRulesParser: Problem processing robots.txt for http://www.zeneszmagazin.hu/nyitolap.html 2017-08-16 15:24:37,765 WARN [FetcherThread] crawlercommons.robots.SimpleRobotRulesParser: Unknown line in robots.txt file (size 551): ????n?0 ?? ?z=`?b[iW`( 2017-08-16 15:24:37,765 WARN [FetcherThread] crawlercommons.robots.SimpleRobotRulesParser: Unknown line in robots.txt file (size 551): ? 2017-08-16 15:24:37,765 WARN [FetcherThread] crawlercommons.robots.SimpleRobotRulesParser: Unknown line in robots.txt file (size 551): ?] ??8??Ii???$???n??\?'?<???;r?J-??p????????{}?Y???9??????q?x???]Lk?'?<VsK????L^?_?v??????>3??9l9??-+????S"????????P???Q5???x5o?? 2017-08-16 15:24:37,765 WARN [FetcherThread] crawlercommons.robots.SimpleRobotRulesParser: Unknown line in robots.txt file (size 551): '?2?:2>???(?z?

5. Crawling AngularJS/Angular2 is not working properly, I tried multiple sites using one of these frameworks, but it seems I'm always getting the 'unloaded' version as the html I'm getting often only consist of a "Loading..." text or something similar.
I tried to set page load delay, but I'm not sure if it's actually working:
page.load.delay = "10"
libselenium.page.load.delay = "10"
selenium.page.load.delay = "10"
Some of the above issues could be because of the faulty data received from some of these sites, auto-skipping those could be an acceptable solution if we can predictably detect those cases.

Anyway, I'd appreciate any help I could get with any of the above issues. :)

Thanks, 
Barnabas
------ Forwarded Message --------
From: Sebastian Nagel <wa...@googlemail.com>
Date: 2017-08-10 09:34:06
Subject: Re: Custom IndexWriter never called on index command
To: user@nutch.apache.org

Hi Barnabas,

> The reduce function of IndexerMapReduce only receives CrawlDatums (line 198) the
> parseData/parseText is always null thus the function returns in line 261.

Parse data and text is stored in segments while the CrawlDatum may come from the CrawlDb.
Does the index job get the segment with the fetched and parsed pages passed as input?
If "parseData/parseText is always null" no segment is read (or the segment is empty).

Best,
Sebastian

On 08/09/2017 07:49 PM, Barnabás Balázs wrote:
> Small followup tidbit:
>
> The reduce function of IndexerMapReduce only receives CrawlDatums (line 198) the parseData/parseText is always null thus the function returns in line 261.
>
> So the main question now:
> Why is it the Indexer only receiving CrawlDatums when the Parse function executed before the Indexer creates the ParseData perfectly?
> On 2017. 08. 09. 19:00:27, Barnabás Balázs wrote:
> Dear community!
>
> I'm relatively new to Nutch 1.x and got stumped on an indexing issue.
> I have a local Java application that sends Nutch jobs to a remote Hadoop deployment for execution. The jobs are sent in the following order:
> Inject -> Generate -> Fetch -> Parse -> Index -> Update -> Invertlinks
> Once a round is finished it starts over. The commands are of course configured based on the previous one's results (when necessary).
>
> This setup seems to work, I can see that fetch gathers the correct urls for example. The problem is the Index stage. I implemented a custom IndexWriter that should send data to Couchbase buckets and Kafka Producers, however even though the plugin seems to construct correctly (I can see Kafka producer setup records in the reduce log), the open/write/update functions are never called. I put logs in each and also used remote debugging to make sure that they are really never called.
> I also used a debugger inside the IndexerMapReduce class and to be honest I'm not sure where the IndexWriter is used, but the job definitely receives data (I saw the fetched urls).
>
> I should mention that I also created an HTMLParseFilter plugin and that one works perfectly, so plugin deployment shouldn't be the issue. Also in the logs I can see the following:
> Registered Plugins: ... Couchbase indexer (indexer-couchbase) ... org.apache.nutch.indexer.IndexWriters: Adding correct.package.Indexer
> I've been stuck on this issue for a few days now, any help/ideas would be appreciated on why my IndexWriter is never called when running an Indexer job.
>
> Best,
> Barnabas
>


Re: Crawl issues and Custom IndexWriter never called on index command solution

Posted by Barnabás Balázs <ba...@impresign.com>.
Hi,

Sadly I haven't been able to progress with these issues, by any chance does anyone in the community know, how any of these problems could be solved?
On 2017. 08. 17. 12:00:32, Barnabás Balázs <ba...@impresign.com> wrote:
Dear Nutch Community,
Thanks for the answer, as it turned out my issue was with the segments, I used the folder in which my segments reside as a parameter instead of the specific segment folder. Changing this solved my issue, so thank you. 
I spent a few days playing around with my crawl setup and ran into a few issues. For the crawl i'm using the libselenium plugin and phantomjs. In the usual setup the generate runs with topN=10000 and the fetcher gets 100 threads.
1. I set the topN parameter of our generate job to a low number (100) to test how well the adaptive crawling changes around the crawled URLs. What I experienced was that mostly the same URLs were fetched and only few (8-10) changed between crawls. The following relevant configurations are set:
db.fetch.interval.default = "3600" db.fetch.schedule.class = "org.apache.nutch.crawl.AdaptiveFetchSchedule" db.fetch.schedule.adaptive.min_interval = "1200" db.fetch.schedule.adaptive.inc_rate = "0.4" db.fetch.schedule.adaptive.dec_rate = "0.2" db.fetch.schedule.adaptive.sync_delta = "true" db.fetch.schedule.adaptive.sync_delta_rate = "0.3"

The main goal would be to have a large scale crawl that only fetches often-changing sites regularly.
2.  I'm seeing quite a lot of UnkownHostException errors for example:
2017-08-16 15:23:28,099 INFO [FetcherThread] org.apache.nutch.protocol.http.api.HttpRobotRulesParser: Couldn't get robots.txt for http://www.tomgeorge.hu/phplapozo/tgkepek.php: java.net.UnknownHostException: www.tomgeorge.hu 2017-08-16 15:23:28,099 ERROR [FetcherThread] org.apache.nutch.protocol.selenium.Http: Failed to get protocol output

I managed to ping the failing sites on the machine where the jobs are running, so I'm not sure how it could be a DNS setting.

3. Selenium throws NoSuchElementExceptions:
2017-08-16 15:28:14,963 ERROR [FetcherThread] org.apache.nutch.protocol.selenium.Http: Failed to get protocol output java.lang.RuntimeException: org.openqa.selenium.NoSuchElementException: {"errorMessage":"Unable to find element with tag name 'body'","request":{"headers":{"Accept-Encoding":"gzip,deflate","Connection":"Keep-Alive","Content-Length":"35","Content-Type":"application/json; charset=utf-8","Host":"localhost:32377","User-Agent":"Apache-HttpClient/4.5.2 (Java/1.8.0_144)"},"httpVersion":"1.1","method":"POST","post":"{\"using\":\"tag name\",\"value\":\"body\"}","url":"/element","urlParsed":{"anchor":"","query":"","file":"element","directory":"/","path":"/element","relative":"/element","port":"","host":"","password":"","user":"","userInfo":"","authority":"","protocol":"","source":"/element","queryKey":{},"chunks":["element"]},"urlOriginal":"/session/7eec7df0-8297-11e7-b678-71c58580ccf2/element"}} Command duration or timeout: 288 milliseconds

I'm seeing this one a lot, not sure what the issue could be. Opened some of the links and there is definitely a <body> tag in their HTML.

4. Encoding issues pop up with some of the robots.txt files. I checked out of multiple of these and they seem fine (no special characters or anything as far as I can tell). This is what the exception looks like:
2017-08-16 15:24:37,765 WARN [FetcherThread] crawlercommons.robots.SimpleRobotRulesParser: Problem processing robots.txt for http://www.zeneszmagazin.hu/nyitolap.html 2017-08-16 15:24:37,765 WARN [FetcherThread] crawlercommons.robots.SimpleRobotRulesParser: Unknown line in robots.txt file (size 551): ????n?0???z=`?b[iW`( 2017-08-16 15:24:37,765 WARN [FetcherThread] crawlercommons.robots.SimpleRobotRulesParser: Unknown line in robots.txt file (size 551): ? 2017-08-16 15:24:37,765 WARN [FetcherThread] crawlercommons.robots.SimpleRobotRulesParser: Unknown line in robots.txt file (size 551): ?]??8??Ii???$???n??\?'?<???;r?J-??p????????{}?Y???9??????q?x???]Lk?'?<VsK????L^?_?v??????>3??9l9??-+????S"????????P???Q5???x5o?? 2017-08-16 15:24:37,765 WARN [FetcherThread] crawlercommons.robots.SimpleRobotRulesParser: Unknown line in robots.txt file (size 551): '?2?:2>???(?z?

5. Crawling AngularJS/Angular2 is not working properly, I tried multiple sites using one of these frameworks, but it seems I'm always getting the 'unloaded' version as the html I'm getting often only consist of a "Loading..." text or something similar.
I tried to set page load delay, but I'm not sure if it's actually working:
page.load.delay = "10"
libselenium.page.load.delay = "10"
selenium.page.load.delay = "10"
Some of the above issues could be because of the faulty data received from some of these sites, auto-skipping those could be an acceptable solution if we can predictably detect those cases.

Anyway, I'd appreciate any help I could get with any of the above issues. :)

Thanks, 
Barnabas
------ Forwarded Message --------
From: Sebastian Nagel <wa...@googlemail.com>
Date: 2017-08-10 09:34:06
Subject: Re: Custom IndexWriter never called on index command
To: user@nutch.apache.org

Hi Barnabas,

> The reduce function of IndexerMapReduce only receives CrawlDatums (line 198) the
> parseData/parseText is always null thus the function returns in line 261.

Parse data and text is stored in segments while the CrawlDatum may come from the CrawlDb.
Does the index job get the segment with the fetched and parsed pages passed as input?
If "parseData/parseText is always null" no segment is read (or the segment is empty).

Best,
Sebastian

On 08/09/2017 07:49 PM, Barnabás Balázs wrote:
> Small followup tidbit:
>
> The reduce function of IndexerMapReduce only receives CrawlDatums (line 198) the parseData/parseText is always null thus the function returns in line 261.
>
> So the main question now:
> Why is it the Indexer only receiving CrawlDatums when the Parse function executed before the Indexer creates the ParseData perfectly?
> On 2017. 08. 09. 19:00:27, Barnabás Balázs wrote:
> Dear community!
>
> I'm relatively new to Nutch 1.x and got stumped on an indexing issue.
> I have a local Java application that sends Nutch jobs to a remote Hadoop deployment for execution. The jobs are sent in the following order:
> Inject -> Generate -> Fetch -> Parse -> Index -> Update -> Invertlinks
> Once a round is finished it starts over. The commands are of course configured based on the previous one's results (when necessary).
>
> This setup seems to work, I can see that fetch gathers the correct urls for example. The problem is the Index stage. I implemented a custom IndexWriter that should send data to Couchbase buckets and Kafka Producers, however even though the plugin seems to construct correctly (I can see Kafka producer setup records in the reduce log), the open/write/update functions are never called. I put logs in each and also used remote debugging to make sure that they are really never called.
> I also used a debugger inside the IndexerMapReduce class and to be honest I'm not sure where the IndexWriter is used, but the job definitely receives data (I saw the fetched urls).
>
> I should mention that I also created an HTMLParseFilter plugin and that one works perfectly, so plugin deployment shouldn't be the issue. Also in the logs I can see the following:
> Registered Plugins: ... Couchbase indexer (indexer-couchbase) ... org.apache.nutch.indexer.IndexWriters: Adding correct.package.Indexer
> I've been stuck on this issue for a few days now, any help/ideas would be appreciated on why my IndexWriter is never called when running an Indexer job.
>
> Best,
> Barnabas
>