You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jack Tang <hi...@gmail.com> on 2005/09/07 09:51:39 UTC

Re: JavaScript Urls

Hi Andrzej

I think javascript-function-and-url mapping is a good solution.
Say
domainName.javascript:go = http://www.a.com/b.jsp?id={0}

"go" is the javascipt function and it contains one param. And
"http://www.a.com/b.jsp?id={0}" is the URL template for "go" function.
and {0} is the exactly param, it should be merged when "go" function
is detected.
Now the problem I face is in "go" function the form is submited, and
the "action" is "POST".

Regards
/Jack

On 6/10/05, Andrzej Bialecki <ab...@getopt.org> wrote:
> Howie Wang wrote:
> > I think you have to hack the parse-html plugin. Look in
> > DOMContentUtils.java
> > in getOutlinks.java.  You'll probably have to look for targets that
> > start with
> > "javascript:" and do some string replacing.
> 
> The latest SVN version already has a JavaScript link extractor
> (JSParseFilter in parse-js plugin). Currently it handles extraction of
> JS snippets from HTML events (onload, onclick, onmouseover, etc), and of
> course from <script> elements. The only thing missing to handle your
> case is to add a clause to handle the "javascript:" in any other attribute.
> 
> I can make this change. Watch the commit messages so that you know when
> to sync your source.
> 
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 


-- 
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: JavaScript Urls

Posted by Jack Tang <hi...@gmail.com>.
:) 
Thanks for your tip

Regards
/Jack

On 9/7/05, Andrzej Bialecki <ab...@getopt.org> wrote:
> Jack Tang wrote:
> > Hi Andrzej
> >
> > I think javascript-function-and-url mapping is a good solution.
> > Say
> > domainName.javascript:go = http://www.a.com/b.jsp?id={0}
> >
> > "go" is the javascipt function and it contains one param. And
> > "http://www.a.com/b.jsp?id={0}" is the URL template for "go" function.
> > and {0} is the exactly param, it should be merged when "go" function
> > is detected.
> > Now the problem I face is in "go" function the form is submited, and
> > the "action" is "POST".
> 
> Wow, that's a pretty old thread... The JS "pseudo-parser" plugin is just
> that - it doesn't really understand JavaScript, it just tries to extract
> urls, and does it with quite high error rate... but still better than
> nothing.
> 
> If you want a full-fledged solution that can actually interpret your
> scripts, then take a look at HttpUnit or HtmlUnit frameworks - both of
> which can be turned into Javascript-aware crawlers.
> 
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 
> 


-- 
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: JavaScript Urls

Posted by Andrzej Bialecki <ab...@getopt.org>.
Jack Tang wrote:
> Hi Andrzej
> 
> I think javascript-function-and-url mapping is a good solution.
> Say
> domainName.javascript:go = http://www.a.com/b.jsp?id={0}
> 
> "go" is the javascipt function and it contains one param. And
> "http://www.a.com/b.jsp?id={0}" is the URL template for "go" function.
> and {0} is the exactly param, it should be merged when "go" function
> is detected.
> Now the problem I face is in "go" function the form is submited, and
> the "action" is "POST".

Wow, that's a pretty old thread... The JS "pseudo-parser" plugin is just 
that - it doesn't really understand JavaScript, it just tries to extract 
urls, and does it with quite high error rate... but still better than 
nothing.

If you want a full-fledged solution that can actually interpret your 
scripts, then take a look at HttpUnit or HtmlUnit frameworks - both of 
which can be turned into Javascript-aware crawlers.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com