You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Yu Gan <ga...@gmail.com> on 2006/12/24 09:14:14 UTC

About javascript URLs

Hi all,

I am newcomer about nutch. in my case, I want to crawl a specific website
that has lots of javascript urls, such as <a href=*javascript*(1);>,
so I wonder if nutch can know javascript urls, but after I find the
maillist, the result is the nutch doesn't support javascript urls,
so I decide to use simple way to solve it, that is to replace "<a href=*
javascript*(1);>" with "<a href='www.site.com/servlet?parameter=1'>" so that
the nutch can know it.
Is it correct?  I think the code need to be added before the nutch analyse
the contents, but how to patch the nutch to do it? anyone here know the
detail?

Yu