You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by ThammeGowda N <tg...@gmail.com> on 2015/10/03 00:13:15 UTC
Redundant requests when interactive selenium is enabled
Hello, Nutch Experts,
Problem: Nutch is making duplicate/redundant requests for resources when
'protocol-interactiveselenium' plugin is enabled.
Questions: Is it an expected behaviour or a misconfigured state?
To reduce the unnecessary network operations, should I need
to disable any preconfigured plugin when interactive selenium plugin is
enabled?
Probing inside the request log of my test server(pasted below), I see a
request is made from Nutch (I guess it is from the fetcher) and another
(sometimes 2) from configured web driver, in my case firefox. I am counting
requests to html pages and not the images, scripts and stylesheets.
*Request Log from test server:*
Available on:
http:127.0.0.1:8080
http:192.168.0.14:8080
Hit CTRL-C to stop the server
[21:23:14 GMT] "GET /robots.txt" "Nutch/1.11 (X11; Linux x86_64)
AppleWebKit/537.36 (KHTML, like Gecko) USC CSCI-572 Fall-15 Group-36
Member-1"
[21:23:14 GMT] "GET /" "Nutch/1.11 (X11; Linux x86_64) AppleWebKit/537.36
(KHTML, like Gecko) USC CSCI-572 Fall-15 Group-36 Member-12"
[21:23:16 GMT] "GET /" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:33.0)
Gecko/20100101 Firefox/33.0"
[21:23:16 GMT] "GET /favicon.ico" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64;
rv:33.0) Gecko/20100101 Firefox/33.0"
[21:23:18 GMT] "GET /" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:33.0)
Gecko/20100101 Firefox/33.0"
[21:23:18 GMT] "GET /favicon.ico" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64;
rv:33.0) Gecko/20100101 Firefox/33.0"
[21:23:36 GMT] "GET /robots.txt" "Nutch/1.11 (X11; Linux x86_64)
AppleWebKit/537.36 (KHTML, like Gecko) USC CSCI-572 Fall-15 Group-36
Member-18"
[21:23:41 GMT] "GET /firstpage.html" "Nutch/1.11 (X11; Linux x86_64)
AppleWebKit/537.36 (KHTML, like Gecko) USC CSCI-572 Fall-15 Group-36
Member-13"
[21:23:44 GMT] "GET /firstpage.html" "Mozilla/5.0 (X11; Ubuntu; Linux
x86_64; rv:33.0) Gecko/20100101 Firefox/33.0"
[21:23:44 GMT] "GET /favicon.ico" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64;
rv:33.0) Gecko/20100101 Firefox/33.0"
[21:23:46 GMT] "GET /firstpage.html" "Mozilla/5.0 (X11; Ubuntu; Linux
x86_64; rv:33.0) Gecko/20100101 Firefox/33.0"
[21:23:46 GMT] "GET /images/first-page.jpg" "Mozilla/5.0 (X11; Ubuntu;
Linux x86_64; rv:33.0) Gecko/20100101 Firefox/33.0"
[21:23:46 GMT] "GET /favicon.ico" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64;
rv:33.0) Gecko/20100101 Firefox/33.0"
[21:23:51 GMT] "GET /js-page.html" "Nutch/1.11 (X11; Linux x86_64)
AppleWebKit/537.36 (KHTML, like Gecko) USC CSCI-572 Fall-15 Group-36
Member-13"
[21:23:53 GMT] "GET /js-page.html" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64;
rv:33.0) Gecko/20100101 Firefox/33.0"
[21:23:53 GMT] "GET /favicon.ico" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64;
rv:33.0) Gecko/20100101 Firefox/33.0"
[21:23:55 GMT] "GET /js-page.html" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64;
rv:33.0) Gecko/20100101 Firefox/33.0"
[21:23:55 GMT] "GET /js/extjs.js" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64;
rv:33.0) Gecko/20100101 Firefox/33.0"
[21:23:55 GMT] "GET /favicon.ico" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64;
rv:33.0) Gecko/20100101 Firefox/33.0"
[21:24:19 GMT] "GET /robots.txt" "Nutch/1.11 (X11; Linux x86_64)
AppleWebKit/537.36 (KHTML, like Gecko) USC CSCI-572 Fall-15 Group-36
Member-13"
[21:24:19 GMT] "GET /secondpage.html" "Nutch/1.11 (X11; Linux x86_64)
AppleWebKit/537.36 (KHTML, like Gecko) USC CSCI-572 Fall-15 Group-36
Member-17"
[21:24:22 GMT] "GET /secondpage.html" "Mozilla/5.0 (X11; Ubuntu; Linux
x86_64; rv:33.0) Gecko/20100101 Firefox/33.0"
[21:24:22 GMT] "GET /favicon.ico" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64;
rv:33.0) Gecko/20100101 Firefox/33.0"
[21:24:24 GMT] "GET /secondpage.html" "Mozilla/5.0 (X11; Ubuntu; Linux
x86_64; rv:33.0) Gecko/20100101 Firefox/33.0"
[21:24:24 GMT] "GET /favicon.ico" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64;
rv:33.0) Gecko/20100101 Firefox/33.0"
[21:24:29 GMT] "GET /index.html" "Nutch/1.11 (X11; Linux x86_64)
AppleWebKit/537.36 (KHTML, like Gecko) USC CSCI-572 Fall-15 Group-36
Member-17"
[21:24:31 GMT] "GET /index.html" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64;
rv:33.0) Gecko/20100101 Firefox/33.0"
[21:24:31 GMT] "GET /favicon.ico" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64;
rv:33.0) Gecko/20100101 Firefox/33.0"
[21:24:33 GMT] "GET /index.html" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64;
rv:33.0) Gecko/20100101 Firefox/33.0"
Enabled Plugins:
<property>
<name>plugin.includes</name>
<value>protocol-interactiveselenium|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|
urlnormalizer-(pass|regex|basic)</value>
</property>
Additional Settings :
Rotating user agent enabled
firefox version 33
selenium version 2.44.0
Thanks and regards,
Thamme Gowda N