You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Patrick Kirsch <pk...@zscho.de> on 2014/06/07 12:25:13 UTC
Nutch use a Browser or phantomjs as fetcher
Hey list,
I'm sure this issue was asked several times, but a quick look in the
nutch user archive did not help, so:
Has anyone documentation or tried to use a browser (like chromium) or
phantomjs etc. for fetching web pages?
Due to a heavily loaded javascript site, nutch needs to see the fully
rendered page.
Second question, would it be better to implement it as plugin or rather
native in the fetcher class?
Regards,
Patrick
Re: Nutch use a Browser or phantomjs as fetcher
Posted by remi tassing <ta...@gmail.com>.
Hi,
I'm planning on modifying protocol-httpclient (HttpResponse.java) based on
this PhantonJSDriver tutorial:
http://assertselenium.com/2013/03/25/getting-started-with-ghostdriver-phantomjs/
I will let you know how it works out
Remi
On Wed, Jun 11, 2014 at 5:25 AM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:
> Hi Patrick
>
> You could look at the protocol-http plugin as an example.
>
> Julien
>
>
> On 10 June 2014 10:22, Patrick Kirsch <pk...@zscho.de> wrote:
>
> > Hey,
> >
> > On 06/10/2014 10:52 AM, Julien Nioche wrote:
> >
> >> Hi
> >>
> >> You can do that as a custom protocol implementation. The fetcher code
> >> would
> >> stay the same but the byte content returned for a given URL would be
> >> produced by phantomjs or whichever selenuim backend you'd to use.
> >>
> > Do you have a documentation/wiki link or example to start from?
> >
> > Currently I implemented it in
> > src/java/org/apache/nutch/fetcher/Fetcher.java
> > as hook, if it contains "html" and "head" in the first 500 characters.
> >
> > Regards,
> > Patrick
> >
> >
> >> HTH
> >>
> >> Julien
> >>
> >>
> >> On 7 June 2014 11:35, remi tassing <ta...@gmail.com> wrote:
> >>
> >> I'm currently looking at those separately but an integrated option
> would
> >>> be
> >>> more efficient.
> >>>
> >>> Looking forward for any experience sharing
> >>>
> >>>
> >>> On Sat, Jun 7, 2014 at 6:25 PM, Patrick Kirsch <pk...@zscho.de>
> wrote:
> >>>
> >>> Hey list,
> >>>> I'm sure this issue was asked several times, but a quick look in the
> >>>> nutch user archive did not help, so:
> >>>>
> >>>> Has anyone documentation or tried to use a browser (like chromium) or
> >>>> phantomjs etc. for fetching web pages?
> >>>>
> >>>> Due to a heavily loaded javascript site, nutch needs to see the fully
> >>>> rendered page.
> >>>>
> >>>> Second question, would it be better to implement it as plugin or
> rather
> >>>> native in the fetcher class?
> >>>>
> >>>> Regards,
> >>>> Patrick
> >>>>
> >>>>
> >>>>
> >>>
> >>
> >>
> >>
> >
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>
Re: Nutch use a Browser or phantomjs as fetcher
Posted by Julien Nioche <li...@gmail.com>.
Hi Patrick
You could look at the protocol-http plugin as an example.
Julien
On 10 June 2014 10:22, Patrick Kirsch <pk...@zscho.de> wrote:
> Hey,
>
> On 06/10/2014 10:52 AM, Julien Nioche wrote:
>
>> Hi
>>
>> You can do that as a custom protocol implementation. The fetcher code
>> would
>> stay the same but the byte content returned for a given URL would be
>> produced by phantomjs or whichever selenuim backend you'd to use.
>>
> Do you have a documentation/wiki link or example to start from?
>
> Currently I implemented it in
> src/java/org/apache/nutch/fetcher/Fetcher.java
> as hook, if it contains "html" and "head" in the first 500 characters.
>
> Regards,
> Patrick
>
>
>> HTH
>>
>> Julien
>>
>>
>> On 7 June 2014 11:35, remi tassing <ta...@gmail.com> wrote:
>>
>> I'm currently looking at those separately but an integrated option would
>>> be
>>> more efficient.
>>>
>>> Looking forward for any experience sharing
>>>
>>>
>>> On Sat, Jun 7, 2014 at 6:25 PM, Patrick Kirsch <pk...@zscho.de> wrote:
>>>
>>> Hey list,
>>>> I'm sure this issue was asked several times, but a quick look in the
>>>> nutch user archive did not help, so:
>>>>
>>>> Has anyone documentation or tried to use a browser (like chromium) or
>>>> phantomjs etc. for fetching web pages?
>>>>
>>>> Due to a heavily loaded javascript site, nutch needs to see the fully
>>>> rendered page.
>>>>
>>>> Second question, would it be better to implement it as plugin or rather
>>>> native in the fetcher class?
>>>>
>>>> Regards,
>>>> Patrick
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>
--
Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble
Re: Nutch use a Browser or phantomjs as fetcher
Posted by Patrick Kirsch <pk...@zscho.de>.
Hey,
On 06/10/2014 10:52 AM, Julien Nioche wrote:
> Hi
>
> You can do that as a custom protocol implementation. The fetcher code would
> stay the same but the byte content returned for a given URL would be
> produced by phantomjs or whichever selenuim backend you'd to use.
Do you have a documentation/wiki link or example to start from?
Currently I implemented it in
src/java/org/apache/nutch/fetcher/Fetcher.java
as hook, if it contains "html" and "head" in the first 500 characters.
Regards,
Patrick
>
> HTH
>
> Julien
>
>
> On 7 June 2014 11:35, remi tassing <ta...@gmail.com> wrote:
>
>> I'm currently looking at those separately but an integrated option would be
>> more efficient.
>>
>> Looking forward for any experience sharing
>>
>>
>> On Sat, Jun 7, 2014 at 6:25 PM, Patrick Kirsch <pk...@zscho.de> wrote:
>>
>>> Hey list,
>>> I'm sure this issue was asked several times, but a quick look in the
>>> nutch user archive did not help, so:
>>>
>>> Has anyone documentation or tried to use a browser (like chromium) or
>>> phantomjs etc. for fetching web pages?
>>>
>>> Due to a heavily loaded javascript site, nutch needs to see the fully
>>> rendered page.
>>>
>>> Second question, would it be better to implement it as plugin or rather
>>> native in the fetcher class?
>>>
>>> Regards,
>>> Patrick
>>>
>>>
>>
>
>
>
Re: Nutch use a Browser or phantomjs as fetcher
Posted by Julien Nioche <li...@gmail.com>.
Hi
You can do that as a custom protocol implementation. The fetcher code would
stay the same but the byte content returned for a given URL would be
produced by phantomjs or whichever selenuim backend you'd to use.
HTH
Julien
On 7 June 2014 11:35, remi tassing <ta...@gmail.com> wrote:
> I'm currently looking at those separately but an integrated option would be
> more efficient.
>
> Looking forward for any experience sharing
>
>
> On Sat, Jun 7, 2014 at 6:25 PM, Patrick Kirsch <pk...@zscho.de> wrote:
>
> > Hey list,
> > I'm sure this issue was asked several times, but a quick look in the
> > nutch user archive did not help, so:
> >
> > Has anyone documentation or tried to use a browser (like chromium) or
> > phantomjs etc. for fetching web pages?
> >
> > Due to a heavily loaded javascript site, nutch needs to see the fully
> > rendered page.
> >
> > Second question, would it be better to implement it as plugin or rather
> > native in the fetcher class?
> >
> > Regards,
> > Patrick
> >
> >
>
--
Open Source Solutions for Text Engineering
http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble
Re: Nutch use a Browser or phantomjs as fetcher
Posted by remi tassing <ta...@gmail.com>.
I'm currently looking at those separately but an integrated option would be
more efficient.
Looking forward for any experience sharing
On Sat, Jun 7, 2014 at 6:25 PM, Patrick Kirsch <pk...@zscho.de> wrote:
> Hey list,
> I'm sure this issue was asked several times, but a quick look in the
> nutch user archive did not help, so:
>
> Has anyone documentation or tried to use a browser (like chromium) or
> phantomjs etc. for fetching web pages?
>
> Due to a heavily loaded javascript site, nutch needs to see the fully
> rendered page.
>
> Second question, would it be better to implement it as plugin or rather
> native in the fetcher class?
>
> Regards,
> Patrick
>
>