You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Michael Gang <mi...@gmail.com> on 2013/01/08 16:15:38 UTC
nutch javascript capabilities
Hi all,
>From the features of nutch
http://wiki.apache.org/nutch/Features
i understand that there is a sort of javascript support
JavaScript (for extracting links only?) (parse-js)
I don't understand what this exactly means.
Let's say if i have a link
<a onclick="do_something">
or a jquery binding in onready
and in this code i open a new window and show there a result of a form
submit
will nutch extract for me the resulting page as link ?
Thanks,
David
Re: nutch javascript capabilities
Posted by Lewis John Mcgibbney <le...@gmail.com>.
Hi Michael,
On Tue, Jan 8, 2013 at 7:15 AM, Michael Gang <mi...@gmail.com> wrote:
> JavaScript (for extracting links only?) (parse-js)
>
Yes, both in and outlinks if present.
>
> I don't understand what this exactly means.
> Let's say if i have a link
> <a onclick="do_something">
> or a jquery binding in onready
> and in this code i open a new window and show there a result of a form
> submit
> will nutch extract for me the resulting page as link ?
>
>
The idea is (taken in part from the class Javadoc) that the parsing
implementation implements a heuristic link extractor for pure JS files and
additionally embedded JS snippets in (x)html. When JS is discovered, the
parsing logic executes a two-pass regex matching for obtaining correct
links which may be useful to a Nutch crawl. This plugin is known to act up
from time to time, however basically the two regex matches boil down to the
following
- a 'simple' string matching pattern which allows invalid URL characters
- an 'altrnative' pattern which limits valid URL chars.
When attempting to extract URLs from literals embedded in JS, the two
patterns are run in that order.
hth
LEwis
Re: nutch javascript capabilities
Posted by feng lu <am...@gmail.com>.
Hi Michael
May be you can custom html parser plugin to parse javascript.
On Tue, Jan 15, 2013 at 6:43 PM, Tejas Patil <te...@gmail.com>wrote:
> AFAIK, you cannot configure Fetcher to make use of firefox or htmlunit. You
> will perhaps have to change the nutch source by yourself.
>
> Thanks,
> Tejas Patil
>
>
> On Tue, Jan 15, 2013 at 12:02 AM, Michael Gang <michaelgang@gmail.com
> >wrote:
>
> > Hi,
> >
> > I understand.
> > Is there a way to use for a set of predefined pages another browser as
> > fetcher?
> > For example, would it be possible to say nutch that he should use firefox
> > or htmlunit as a fetcher?
> > There are many internet sites with ajax loads and where a click makes a
> > form submit, where no real html snippets exist.
> >
> > Thanks,
> > David
> >
> >
> > On Sun, Jan 13, 2013 at 8:08 PM, Lewis John Mcgibbney <
> > lewis.mcgibbney@gmail.com> wrote:
> >
> > > This should be correct yes.
> > > If you look at the plugin source you can see the patterns it uses to
> > > extract links.
> > > Also you can check what's iyour crawldb using the readdb command
> > > Hth
> > > Lewis
> > >
> > > On Saturday, January 12, 2013, Michael Gang <mi...@gmail.com>
> > wrote:
> > > > Hi,
> > > >
> > > > So if there is a javascript which actually submits a form, nutch
> won't
> > > > follow the link, because it just deals with urls.
> > > > Is this correct?
> > > >
> > > > Thanks,
> > > > David
> > > >
> > > >
> > > > On Tue, Jan 8, 2013 at 5:15 PM, Michael Gang <mi...@gmail.com>
> > > wrote:
> > > >
> > > >> Hi all,
> > > >>
> > > >> From the features of nutch
> > > >> http://wiki.apache.org/nutch/Features
> > > >> i understand that there is a sort of javascript support
> > > >>
> > > >> JavaScript (for extracting links only?) (parse-js)
> > > >>
> > > >> I don't understand what this exactly means.
> > > >> Let's say if i have a link
> > > >> <a onclick="do_something">
> > > >> or a jquery binding in onready
> > > >> and in this code i open a new window and show there a result of a
> form
> > > >> submit
> > > >> will nutch extract for me the resulting page as link ?
> > > >>
> > > >> Thanks,
> > > >> David
> > > >>
> > > >>
> > > >
> > >
> > > --
> > > *Lewis*
> > >
> >
>
--
Don't Grow Old, Grow Up... :-)
Re: nutch javascript capabilities
Posted by Tejas Patil <te...@gmail.com>.
AFAIK, you cannot configure Fetcher to make use of firefox or htmlunit. You
will perhaps have to change the nutch source by yourself.
Thanks,
Tejas Patil
On Tue, Jan 15, 2013 at 12:02 AM, Michael Gang <mi...@gmail.com>wrote:
> Hi,
>
> I understand.
> Is there a way to use for a set of predefined pages another browser as
> fetcher?
> For example, would it be possible to say nutch that he should use firefox
> or htmlunit as a fetcher?
> There are many internet sites with ajax loads and where a click makes a
> form submit, where no real html snippets exist.
>
> Thanks,
> David
>
>
> On Sun, Jan 13, 2013 at 8:08 PM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
> > This should be correct yes.
> > If you look at the plugin source you can see the patterns it uses to
> > extract links.
> > Also you can check what's iyour crawldb using the readdb command
> > Hth
> > Lewis
> >
> > On Saturday, January 12, 2013, Michael Gang <mi...@gmail.com>
> wrote:
> > > Hi,
> > >
> > > So if there is a javascript which actually submits a form, nutch won't
> > > follow the link, because it just deals with urls.
> > > Is this correct?
> > >
> > > Thanks,
> > > David
> > >
> > >
> > > On Tue, Jan 8, 2013 at 5:15 PM, Michael Gang <mi...@gmail.com>
> > wrote:
> > >
> > >> Hi all,
> > >>
> > >> From the features of nutch
> > >> http://wiki.apache.org/nutch/Features
> > >> i understand that there is a sort of javascript support
> > >>
> > >> JavaScript (for extracting links only?) (parse-js)
> > >>
> > >> I don't understand what this exactly means.
> > >> Let's say if i have a link
> > >> <a onclick="do_something">
> > >> or a jquery binding in onready
> > >> and in this code i open a new window and show there a result of a form
> > >> submit
> > >> will nutch extract for me the resulting page as link ?
> > >>
> > >> Thanks,
> > >> David
> > >>
> > >>
> > >
> >
> > --
> > *Lewis*
> >
>
Re: nutch javascript capabilities
Posted by Michael Gang <mi...@gmail.com>.
Hi,
I understand.
Is there a way to use for a set of predefined pages another browser as
fetcher?
For example, would it be possible to say nutch that he should use firefox
or htmlunit as a fetcher?
There are many internet sites with ajax loads and where a click makes a
form submit, where no real html snippets exist.
Thanks,
David
On Sun, Jan 13, 2013 at 8:08 PM, Lewis John Mcgibbney <
lewis.mcgibbney@gmail.com> wrote:
> This should be correct yes.
> If you look at the plugin source you can see the patterns it uses to
> extract links.
> Also you can check what's iyour crawldb using the readdb command
> Hth
> Lewis
>
> On Saturday, January 12, 2013, Michael Gang <mi...@gmail.com> wrote:
> > Hi,
> >
> > So if there is a javascript which actually submits a form, nutch won't
> > follow the link, because it just deals with urls.
> > Is this correct?
> >
> > Thanks,
> > David
> >
> >
> > On Tue, Jan 8, 2013 at 5:15 PM, Michael Gang <mi...@gmail.com>
> wrote:
> >
> >> Hi all,
> >>
> >> From the features of nutch
> >> http://wiki.apache.org/nutch/Features
> >> i understand that there is a sort of javascript support
> >>
> >> JavaScript (for extracting links only?) (parse-js)
> >>
> >> I don't understand what this exactly means.
> >> Let's say if i have a link
> >> <a onclick="do_something">
> >> or a jquery binding in onready
> >> and in this code i open a new window and show there a result of a form
> >> submit
> >> will nutch extract for me the resulting page as link ?
> >>
> >> Thanks,
> >> David
> >>
> >>
> >
>
> --
> *Lewis*
>
Re: nutch javascript capabilities
Posted by Lewis John Mcgibbney <le...@gmail.com>.
This should be correct yes.
If you look at the plugin source you can see the patterns it uses to
extract links.
Also you can check what's iyour crawldb using the readdb command
Hth
Lewis
On Saturday, January 12, 2013, Michael Gang <mi...@gmail.com> wrote:
> Hi,
>
> So if there is a javascript which actually submits a form, nutch won't
> follow the link, because it just deals with urls.
> Is this correct?
>
> Thanks,
> David
>
>
> On Tue, Jan 8, 2013 at 5:15 PM, Michael Gang <mi...@gmail.com>
wrote:
>
>> Hi all,
>>
>> From the features of nutch
>> http://wiki.apache.org/nutch/Features
>> i understand that there is a sort of javascript support
>>
>> JavaScript (for extracting links only?) (parse-js)
>>
>> I don't understand what this exactly means.
>> Let's say if i have a link
>> <a onclick="do_something">
>> or a jquery binding in onready
>> and in this code i open a new window and show there a result of a form
>> submit
>> will nutch extract for me the resulting page as link ?
>>
>> Thanks,
>> David
>>
>>
>
--
*Lewis*
Re: nutch javascript capabilities
Posted by Michael Gang <mi...@gmail.com>.
Hi,
So if there is a javascript which actually submits a form, nutch won't
follow the link, because it just deals with urls.
Is this correct?
Thanks,
David
On Tue, Jan 8, 2013 at 5:15 PM, Michael Gang <mi...@gmail.com> wrote:
> Hi all,
>
> From the features of nutch
> http://wiki.apache.org/nutch/Features
> i understand that there is a sort of javascript support
>
> JavaScript (for extracting links only?) (parse-js)
>
> I don't understand what this exactly means.
> Let's say if i have a link
> <a onclick="do_something">
> or a jquery binding in onready
> and in this code i open a new window and show there a result of a form
> submit
> will nutch extract for me the resulting page as link ?
>
> Thanks,
> David
>
>