You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Elwin <ma...@gmail.com> on 2006/04/19 09:25:45 UTC

How to deal with javascript urls?

for example:
<a href="javascript:customCss(6017162)" id="customCssMenu" >test</a>
in fact, can nutch get content from such kind of urls?

Re: How to deal with javascript urls?

Posted by Andrzej Bialecki <ab...@getopt.org>.

Elwin wrote:
>  for example: <a href="javascript:customCss(6017162)"
>  id="customCssMenu" >test</a> in fact, can nutch get content from such
>  kind of urls?
>
>

Not without some drastic changes... I have an early implementation of a 
fetcher that uses httpunit library to actually interpret the javascript 
and mimick browser's behavior. The problem is that it's very slow - 
current fetcher implementation is stateless, the one that would support 
javascript needs to be stateful, and it needs to retrieve multiple 
resources in one go (e.g. CSS, frames, script files, the main body, 
etc). Then, discovering all outlinks requires a simulated "click" on all 
active elements, which in turn requires executing all scripts associated 
with all current windows. If scripts are not idempotent, you need to 
simulate the "Back" button, or drop/reload everything to restore 
previous state ...

So, it's not easy. Your best bet would be to use a separate fetcher to 
fetch these problematic sites, and use the standard fetcher to fetch 
everything else.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com