You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by ThammeGowda N <tg...@gmail.com> on 2015/10/03 00:13:15 UTC

Redundant requests when interactive selenium is enabled

Hello, Nutch Experts,


Problem: Nutch is making duplicate/redundant requests for resources when
'protocol-interactiveselenium' plugin is enabled.


Questions: Is it an expected behaviour or a misconfigured state?

                To reduce the unnecessary network operations, should I need
to disable any preconfigured plugin when interactive selenium plugin is
enabled?

Probing inside the request log of my test server(pasted below), I see a
request is made from Nutch (I guess it is from the fetcher) and another
(sometimes 2) from configured web driver, in my case firefox. I am counting
requests to html pages and not the images, scripts and stylesheets.


*Request Log from test server:*

Available on:

 http:127.0.0.1:8080

 http:192.168.0.14:8080

Hit CTRL-C to stop the server

[21:23:14 GMT] "GET /robots.txt" "Nutch/1.11 (X11; Linux x86_64)
AppleWebKit/537.36 (KHTML, like Gecko) USC CSCI-572 Fall-15 Group-36
Member-1"

[21:23:14 GMT] "GET /" "Nutch/1.11 (X11; Linux x86_64) AppleWebKit/537.36
(KHTML, like Gecko) USC CSCI-572 Fall-15 Group-36 Member-12"

[21:23:16 GMT] "GET /" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:33.0)
Gecko/20100101 Firefox/33.0"

[21:23:16 GMT] "GET /favicon.ico" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64;
rv:33.0) Gecko/20100101 Firefox/33.0"

[21:23:18 GMT] "GET /" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:33.0)
Gecko/20100101 Firefox/33.0"

[21:23:18 GMT] "GET /favicon.ico" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64;
rv:33.0) Gecko/20100101 Firefox/33.0"

[21:23:36 GMT] "GET /robots.txt" "Nutch/1.11 (X11; Linux x86_64)
AppleWebKit/537.36 (KHTML, like Gecko) USC CSCI-572 Fall-15 Group-36
Member-18"

[21:23:41 GMT] "GET /firstpage.html" "Nutch/1.11 (X11; Linux x86_64)
AppleWebKit/537.36 (KHTML, like Gecko) USC CSCI-572 Fall-15 Group-36
Member-13"

[21:23:44 GMT] "GET /firstpage.html" "Mozilla/5.0 (X11; Ubuntu; Linux
x86_64; rv:33.0) Gecko/20100101 Firefox/33.0"

[21:23:44 GMT] "GET /favicon.ico" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64;
rv:33.0) Gecko/20100101 Firefox/33.0"

[21:23:46 GMT] "GET /firstpage.html" "Mozilla/5.0 (X11; Ubuntu; Linux
x86_64; rv:33.0) Gecko/20100101 Firefox/33.0"

[21:23:46 GMT] "GET /images/first-page.jpg" "Mozilla/5.0 (X11; Ubuntu;
Linux x86_64; rv:33.0) Gecko/20100101 Firefox/33.0"

[21:23:46 GMT] "GET /favicon.ico" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64;
rv:33.0) Gecko/20100101 Firefox/33.0"

[21:23:51 GMT] "GET /js-page.html" "Nutch/1.11 (X11; Linux x86_64)
AppleWebKit/537.36 (KHTML, like Gecko) USC CSCI-572 Fall-15 Group-36
Member-13"

[21:23:53 GMT] "GET /js-page.html" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64;
rv:33.0) Gecko/20100101 Firefox/33.0"

[21:23:53 GMT] "GET /favicon.ico" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64;
rv:33.0) Gecko/20100101 Firefox/33.0"

[21:23:55 GMT] "GET /js-page.html" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64;
rv:33.0) Gecko/20100101 Firefox/33.0"

[21:23:55 GMT] "GET /js/extjs.js" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64;
rv:33.0) Gecko/20100101 Firefox/33.0"

[21:23:55 GMT] "GET /favicon.ico" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64;
rv:33.0) Gecko/20100101 Firefox/33.0"

[21:24:19 GMT] "GET /robots.txt" "Nutch/1.11 (X11; Linux x86_64)
AppleWebKit/537.36 (KHTML, like Gecko) USC CSCI-572 Fall-15 Group-36
Member-13"

[21:24:19 GMT] "GET /secondpage.html" "Nutch/1.11 (X11; Linux x86_64)
AppleWebKit/537.36 (KHTML, like Gecko) USC CSCI-572 Fall-15 Group-36
Member-17"

[21:24:22 GMT] "GET /secondpage.html" "Mozilla/5.0 (X11; Ubuntu; Linux
x86_64; rv:33.0) Gecko/20100101 Firefox/33.0"

[21:24:22 GMT] "GET /favicon.ico" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64;
rv:33.0) Gecko/20100101 Firefox/33.0"

[21:24:24 GMT] "GET /secondpage.html" "Mozilla/5.0 (X11; Ubuntu; Linux
x86_64; rv:33.0) Gecko/20100101 Firefox/33.0"

[21:24:24 GMT] "GET /favicon.ico" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64;
rv:33.0) Gecko/20100101 Firefox/33.0"

[21:24:29 GMT] "GET /index.html" "Nutch/1.11 (X11; Linux x86_64)
AppleWebKit/537.36 (KHTML, like Gecko) USC CSCI-572 Fall-15 Group-36
Member-17"

[21:24:31 GMT] "GET /index.html" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64;
rv:33.0) Gecko/20100101 Firefox/33.0"

[21:24:31 GMT] "GET /favicon.ico" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64;
rv:33.0) Gecko/20100101 Firefox/33.0"

[21:24:33 GMT] "GET /index.html" "Mozilla/5.0 (X11; Ubuntu; Linux x86_64;
rv:33.0) Gecko/20100101 Firefox/33.0"


Enabled Plugins:

<property>

  <name>plugin.includes</name>

  <value>protocol-interactiveselenium|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|scoring-opic|
urlnormalizer-(pass|regex|basic)</value>

</property>


Additional Settings :

 Rotating user agent enabled

firefox version 33

selenium version 2.44.0



Thanks and regards,

Thamme Gowda N