You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Jorge Luis Betancourt González <jl...@uci.cu> on 2015/07/06 19:32:18 UTC

Re: [MASSMAIL]RE: Nutch and JS/Css rendering

For maximum compatibility with browser rendering, perhaps something like phantomjs/slimerjs could be used, for one particular system we build sometime ago we used a separated system built in nodejs that will render a screenshot of a page as it looked like, in a parser plugin we just call this other system that returns a base64 encoded screenshot of the web page, additional parameters could be set in the request (width, height, etc.). You could even specify which portion of the page you wanted, an additional js content to be executed before the screenshot could be taken, we used this in a POC to highlight certain portions of the page.

When we implemented this the selenium plugin didn't existed and selenium allows to use phantomjs, but to be honest I haven't had the chance to fully test the selenium plugin. The downside is that this adds another moving part in your architecture unlike the HtmlUnit approach.

Regards,

----- Original Message -----
From: "Markus Jelsma" <ma...@openindex.io>
To: dev@nutch.apache.org
Sent: Monday, July 6, 2015 11:39:22 AM
Subject: [MASSMAIL]RE: Nutch and JS/Css rendering

Hello Talat! You can embed it in a parse filter plugin or even better, a parser plugin. There is a method to detect client side redirects via JS but also meta tags. HtmlUnit will start methods on most events such as onLoad, just like a browser would.

M.
 
-----Original message-----
> From:Talat Uyarer <ta...@uyarer.com>
> Sent: Monday 6th July 2015 15:38
> To: dev@nutch.apache.org
> Subject: Re: Nutch and JS/Css rendering
> 
> Hi Markus,
> 
> Thanks for sharing your experience. We can use HtmlUnit four our
> feature. Actually I do not understand their architecture and in our
> Nutch architecture How should we position HtmlUnit ? How do they
> handle Ajax based pages For example How do they know which js function
> should run ?  Secondly How to handle Js based Redirect pages ? Have
> you any idea ?
> 
> 2015-07-06 13:02 GMT+03:00 Markus Jelsma <ma...@openindex.io>:
> > Hello Talat - we have used HtmlUnit to execute JS inside our parsers. It works very well but, whatever i tried, i have not been able to make events work on scrolldown. Since HtmlUnit is a lib, it does not require a separate daemon such as Selenium, which is an advantage in distributed fault-tolerant jobs.
> >
> > M.
> >
> > -----Original message-----
> >> From:Talat Uyarer <ta...@uyarer.com>
> >> Sent: Monday 6th July 2015 11:34
> >> To: dev@nutch.apache.org
> >> Subject: Nutch and JS/Css rendering
> >>
> >> Hi all,
> >>
> >> I saw in there[1] "Google decided to try to understand pages by
> >> executing JavaScript." What do you think, can we  give JS rendering
> >> support for Nutch ? If you have an idea please share with me, I will
> >> be glad.
> >>
> >> [1] http://googlewebmastercentral.blogspot.com.tr/2014/05/understanding-web-pages-better.html
> >>
> >> --
> >> Talat UYARER
> >>
> 
> 
> 
> -- 
> Talat UYARER
> Websitesi: http://talat.uyarer.com
> Twitter: http://twitter.com/talatuyarer
> Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
>