You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Michael Gang <mi...@gmail.com> on 2013/01/08 16:15:38 UTC

nutch javascript capabilities

Hi all,

>From the features of nutch
http://wiki.apache.org/nutch/Features
i understand that there is a sort of javascript support

JavaScript (for extracting links only?) (parse-js)

I don't understand what this exactly means.
Let's say if i have a link
<a onclick="do_something">
or a jquery binding in onready
and in this code i open a new window and show there a result of a form
submit
will nutch extract for me the resulting page as link ?

Thanks,
David

Re: nutch javascript capabilities

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Michael,


On Tue, Jan 8, 2013 at 7:15 AM, Michael Gang <mi...@gmail.com> wrote:

> JavaScript (for extracting links only?) (parse-js)
>

Yes, both in and outlinks if present.


>
> I don't understand what this exactly means.
> Let's say if i have a link
> <a onclick="do_something">
> or a jquery binding in onready
> and in this code i open a new window and show there a result of a form
> submit
> will nutch extract for me the resulting page as link ?
>
>
The idea is (taken in part from the class Javadoc) that the parsing
implementation implements a heuristic link extractor for pure JS files and
additionally embedded JS snippets in (x)html. When JS is discovered, the
parsing logic executes a two-pass regex matching for obtaining correct
links which may be useful to a Nutch crawl. This plugin is known to act up
from time to time, however basically the two regex matches boil down to the
following
- a 'simple' string matching pattern which allows invalid URL characters
- an 'altrnative' pattern which limits valid URL chars.

When attempting to extract URLs from literals embedded in JS, the two
patterns are run in that order.

hth
LEwis

Re: nutch javascript capabilities

Posted by feng lu <am...@gmail.com>.

Hi Michael

May be you can custom html parser plugin to parse javascript.



On Tue, Jan 15, 2013 at 6:43 PM, Tejas Patil <te...@gmail.com>wrote:

> AFAIK, you cannot configure Fetcher to make use of firefox or htmlunit. You
> will perhaps have to change the nutch source by yourself.
>
> Thanks,
> Tejas Patil
>
>
> On Tue, Jan 15, 2013 at 12:02 AM, Michael Gang <michaelgang@gmail.com
> >wrote:
>
> > Hi,
> >
> > I understand.
> > Is there a way to use for a set of predefined pages another browser as
> > fetcher?
> > For example, would it be possible to say nutch that he should use firefox
> > or htmlunit as a fetcher?
> > There are many internet sites with ajax loads and where a click makes a
> > form submit, where no real html snippets exist.
> >
> > Thanks,
> > David
> >
> >
> > On Sun, Jan 13, 2013 at 8:08 PM, Lewis John Mcgibbney <
> > lewis.mcgibbney@gmail.com> wrote:
> >
> > > This should be correct yes.
> > > If you look at the plugin source you can see the patterns it uses to
> > > extract links.
> > > Also you can check what's iyour crawldb using the readdb command
> > > Hth
> > > Lewis
> > >
> > > On Saturday, January 12, 2013, Michael Gang <mi...@gmail.com>
> > wrote:
> > > > Hi,
> > > >
> > > > So if there is a javascript which actually submits a form, nutch
> won't
> > > > follow the link, because it just deals with urls.
> > > > Is this correct?
> > > >
> > > > Thanks,
> > > > David
> > > >
> > > >
> > > > On Tue, Jan 8, 2013 at 5:15 PM, Michael Gang <mi...@gmail.com>
> > > wrote:
> > > >
> > > >> Hi all,
> > > >>
> > > >> From the features of nutch
> > > >> http://wiki.apache.org/nutch/Features
> > > >> i understand that there is a sort of javascript support
> > > >>
> > > >> JavaScript (for extracting links only?) (parse-js)
> > > >>
> > > >> I don't understand what this exactly means.
> > > >> Let's say if i have a link
> > > >> <a onclick="do_something">
> > > >> or a jquery binding in onready
> > > >> and in this code i open a new window and show there a result of a
> form
> > > >> submit
> > > >> will nutch extract for me the resulting page as link ?
> > > >>
> > > >> Thanks,
> > > >> David
> > > >>
> > > >>
> > > >
> > >
> > > --
> > > *Lewis*
> > >
> >
>



-- 
Don't Grow Old, Grow Up... :-)

Re: nutch javascript capabilities

Posted by Tejas Patil <te...@gmail.com>.

AFAIK, you cannot configure Fetcher to make use of firefox or htmlunit. You
will perhaps have to change the nutch source by yourself.

Thanks,
Tejas Patil


On Tue, Jan 15, 2013 at 12:02 AM, Michael Gang <mi...@gmail.com>wrote:

> Hi,
>
> I understand.
> Is there a way to use for a set of predefined pages another browser as
> fetcher?
> For example, would it be possible to say nutch that he should use firefox
> or htmlunit as a fetcher?
> There are many internet sites with ajax loads and where a click makes a
> form submit, where no real html snippets exist.
>
> Thanks,
> David
>
>
> On Sun, Jan 13, 2013 at 8:08 PM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
> > This should be correct yes.
> > If you look at the plugin source you can see the patterns it uses to
> > extract links.
> > Also you can check what's iyour crawldb using the readdb command
> > Hth
> > Lewis
> >
> > On Saturday, January 12, 2013, Michael Gang <mi...@gmail.com>
> wrote:
> > > Hi,
> > >
> > > So if there is a javascript which actually submits a form, nutch won't
> > > follow the link, because it just deals with urls.
> > > Is this correct?
> > >
> > > Thanks,
> > > David
> > >
> > >
> > > On Tue, Jan 8, 2013 at 5:15 PM, Michael Gang <mi...@gmail.com>
> > wrote:
> > >
> > >> Hi all,
> > >>
> > >> From the features of nutch
> > >> http://wiki.apache.org/nutch/Features
> > >> i understand that there is a sort of javascript support
> > >>
> > >> JavaScript (for extracting links only?) (parse-js)
> > >>
> > >> I don't understand what this exactly means.
> > >> Let's say if i have a link
> > >> <a onclick="do_something">
> > >> or a jquery binding in onready
> > >> and in this code i open a new window and show there a result of a form
> > >> submit
> > >> will nutch extract for me the resulting page as link ?
> > >>
> > >> Thanks,
> > >> David
> > >>
> > >>
> > >
> >
> > --
> > *Lewis*
> >
>

Re: nutch javascript capabilities

Posted by Michael Gang <mi...@gmail.com>.

Hi,

I understand.
Is there a way to use for a set of predefined pages another browser as
fetcher?
For example, would it be possible to say nutch that he should use firefox
or htmlunit as a fetcher?
There are many internet sites with ajax loads and where a click makes a
form submit, where no real html snippets exist.

Thanks,
David


On Sun, Jan 13, 2013 at 8:08 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:

> This should be correct yes.
> If you look at the plugin source you can see the patterns it uses to
> extract links.
> Also you can check what's iyour crawldb using the readdb command
> Hth
> Lewis
>
> On Saturday, January 12, 2013, Michael Gang <mi...@gmail.com> wrote:
> > Hi,
> >
> > So if there is a javascript which actually submits a form, nutch won't
> > follow the link, because it just deals with urls.
> > Is this correct?
> >
> > Thanks,
> > David
> >
> >
> > On Tue, Jan 8, 2013 at 5:15 PM, Michael Gang <mi...@gmail.com>
> wrote:
> >
> >> Hi all,
> >>
> >> From the features of nutch
> >> http://wiki.apache.org/nutch/Features
> >> i understand that there is a sort of javascript support
> >>
> >> JavaScript (for extracting links only?) (parse-js)
> >>
> >> I don't understand what this exactly means.
> >> Let's say if i have a link
> >> <a onclick="do_something">
> >> or a jquery binding in onready
> >> and in this code i open a new window and show there a result of a form
> >> submit
> >> will nutch extract for me the resulting page as link ?
> >>
> >> Thanks,
> >> David
> >>
> >>
> >
>
> --
> *Lewis*
>

Re: nutch javascript capabilities

Posted by Lewis John Mcgibbney <le...@gmail.com>.

This should be correct yes.
If you look at the plugin source you can see the patterns it uses to
extract links.
Also you can check what's iyour crawldb using the readdb command
Hth
Lewis

On Saturday, January 12, 2013, Michael Gang <mi...@gmail.com> wrote:
> Hi,
>
> So if there is a javascript which actually submits a form, nutch won't
> follow the link, because it just deals with urls.
> Is this correct?
>
> Thanks,
> David
>
>
> On Tue, Jan 8, 2013 at 5:15 PM, Michael Gang <mi...@gmail.com>
wrote:
>
>> Hi all,
>>
>> From the features of nutch
>> http://wiki.apache.org/nutch/Features
>> i understand that there is a sort of javascript support
>>
>> JavaScript (for extracting links only?) (parse-js)
>>
>> I don't understand what this exactly means.
>> Let's say if i have a link
>> <a onclick="do_something">
>> or a jquery binding in onready
>> and in this code i open a new window and show there a result of a form
>> submit
>> will nutch extract for me the resulting page as link ?
>>
>> Thanks,
>> David
>>
>>
>

-- 
*Lewis*

Re: nutch javascript capabilities

Posted by Michael Gang <mi...@gmail.com>.

Hi,

So if there is a javascript which actually submits a form, nutch won't
follow the link, because it just deals with urls.
Is this correct?

Thanks,
David


On Tue, Jan 8, 2013 at 5:15 PM, Michael Gang <mi...@gmail.com> wrote:

> Hi all,
>
> From the features of nutch
> http://wiki.apache.org/nutch/Features
> i understand that there is a sort of javascript support
>
> JavaScript (for extracting links only?) (parse-js)
>
> I don't understand what this exactly means.
> Let's say if i have a link
> <a onclick="do_something">
> or a jquery binding in onready
> and in this code i open a new window and show there a result of a form
> submit
> will nutch extract for me the resulting page as link ?
>
> Thanks,
> David
>
>