You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Marco Crivellaro <ma...@gmail.com> on 2012/12/12 17:07:11 UTC

href links with javascript

Hi all,

I am new to nutch and trying to find my way on crawling a single website.

The url where crawling starts from contains some href with javascript,
these javascript calls contains the relative link to the page as one of the
parameters.

I don't really need to run javascript server side, I'd just need to replace
the havascript with a canonical link.

I believe I should use regexp normalize but it doesn't seem to work. Is
this correct?
Is there any way I can test how the crawled content looks like once
normalized?

Re: href links with javascript

Posted by Marco Crivellaro <ma...@gmail.com>.
thank you for your reply Lewis. I don't seem to be able to get ParseCheker
to output outlinks,
can you advice on this?

On Wed, Dec 12, 2012 at 6:51 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

>
> http://svn.apache.org/repos/asf/nutch/trunk/src/java/org/apache/nutch/parse/ParserChecker.java

Re: href links with javascript

Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Marco,

On Wed, Dec 12, 2012 at 4:07 PM, Marco Crivellaro <ma...@gmail.com> wrote:

> I don't really need to run javascript server side, I'd just need to replace
> the havascript with a canonical link.

Sounds reasonable

> I believe I should use regexp normalize but it doesn't seem to work. Is
> this correct?

Well what have you been doing? Have you been trying to write rules for
regex-normalize? What do they look like? What results are you getting?

> Is there any way I can test how the crawled content looks like once
> normalized?

You can look at the ParserChecker tool [0] this will give you all
outlinks for a given URL. I think this should enable you to see how
such links have been normalized on the fly.
I would also advise you to have a look at the parse-js plugin. Not
many people are keen on it as its consistency seems rather
unpredictable but I would certainly give it a shot.

[0] http://svn.apache.org/repos/asf/nutch/trunk/src/java/org/apache/nutch/parse/ParserChecker.java



-- 
Lewis