You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Jack Tang <hi...@gmail.com> on 2005/09/07 09:51:39 UTC
Re: JavaScript Urls
Hi Andrzej
I think javascript-function-and-url mapping is a good solution.
Say
domainName.javascript:go = http://www.a.com/b.jsp?id={0}
"go" is the javascipt function and it contains one param. And
"http://www.a.com/b.jsp?id={0}" is the URL template for "go" function.
and {0} is the exactly param, it should be merged when "go" function
is detected.
Now the problem I face is in "go" function the form is submited, and
the "action" is "POST".
Regards
/Jack
On 6/10/05, Andrzej Bialecki <ab...@getopt.org> wrote:
> Howie Wang wrote:
> > I think you have to hack the parse-html plugin. Look in
> > DOMContentUtils.java
> > in getOutlinks.java. You'll probably have to look for targets that
> > start with
> > "javascript:" and do some string replacing.
>
> The latest SVN version already has a JavaScript link extractor
> (JSParseFilter in parse-js plugin). Currently it handles extraction of
> JS snippets from HTML events (onload, onclick, onmouseover, etc), and of
> course from <script> elements. The only thing missing to handle your
> case is to add a clause to handle the "javascript:" in any other attribute.
>
> I can make this change. Watch the commit messages so that you know when
> to sync your source.
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
>
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Re: JavaScript Urls
Posted by Jack Tang <hi...@gmail.com>.
:)
Thanks for your tip
Regards
/Jack
On 9/7/05, Andrzej Bialecki <ab...@getopt.org> wrote:
> Jack Tang wrote:
> > Hi Andrzej
> >
> > I think javascript-function-and-url mapping is a good solution.
> > Say
> > domainName.javascript:go = http://www.a.com/b.jsp?id={0}
> >
> > "go" is the javascipt function and it contains one param. And
> > "http://www.a.com/b.jsp?id={0}" is the URL template for "go" function.
> > and {0} is the exactly param, it should be merged when "go" function
> > is detected.
> > Now the problem I face is in "go" function the form is submited, and
> > the "action" is "POST".
>
> Wow, that's a pretty old thread... The JS "pseudo-parser" plugin is just
> that - it doesn't really understand JavaScript, it just tries to extract
> urls, and does it with quite high error rate... but still better than
> nothing.
>
> If you want a full-fledged solution that can actually interpret your
> scripts, then take a look at HttpUnit or HtmlUnit frameworks - both of
> which can be turned into Javascript-aware crawlers.
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _ __________________________________
> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
> ___|||__|| \| || | Embedded Unix, System Integration
> http://www.sigram.com Contact: info at sigram dot com
>
>
--
Keep Discovering ... ...
http://www.jroller.com/page/jmars
Re: JavaScript Urls
Posted by Andrzej Bialecki <ab...@getopt.org>.
Jack Tang wrote:
> Hi Andrzej
>
> I think javascript-function-and-url mapping is a good solution.
> Say
> domainName.javascript:go = http://www.a.com/b.jsp?id={0}
>
> "go" is the javascipt function and it contains one param. And
> "http://www.a.com/b.jsp?id={0}" is the URL template for "go" function.
> and {0} is the exactly param, it should be merged when "go" function
> is detected.
> Now the problem I face is in "go" function the form is submited, and
> the "action" is "POST".
Wow, that's a pretty old thread... The JS "pseudo-parser" plugin is just
that - it doesn't really understand JavaScript, it just tries to extract
urls, and does it with quite high error rate... but still better than
nothing.
If you want a full-fledged solution that can actually interpret your
scripts, then take a look at HttpUnit or HtmlUnit frameworks - both of
which can be turned into Javascript-aware crawlers.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com