You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Elwin <ma...@gmail.com> on 2006/04/19 09:25:45 UTC
How to deal with javascript urls?
for example:
<a href="javascript:customCss(6017162)" id="customCssMenu" >test</a>
in fact, can nutch get content from such kind of urls?
Re: How to deal with javascript urls?
Posted by Andrzej Bialecki <ab...@getopt.org>.
Elwin wrote:
> for example: <a href="javascript:customCss(6017162)"
> id="customCssMenu" >test</a> in fact, can nutch get content from such
> kind of urls?
>
>
Not without some drastic changes... I have an early implementation of a
fetcher that uses httpunit library to actually interpret the javascript
and mimick browser's behavior. The problem is that it's very slow -
current fetcher implementation is stateless, the one that would support
javascript needs to be stateful, and it needs to retrieve multiple
resources in one go (e.g. CSS, frames, script files, the main body,
etc). Then, discovering all outlinks requires a simulated "click" on all
active elements, which in turn requires executing all scripts associated
with all current windows. If scripts are not idempotent, you need to
simulate the "Back" button, or drop/reload everything to restore
previous state ...
So, it's not easy. Your best bet would be to use a separate fetcher to
fetch these problematic sites, and use the standard fetcher to fetch
everything else.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com