You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Patrick Kirsch <pk...@zscho.de> on 2014/06/07 12:25:13 UTC

Nutch use a Browser or phantomjs as fetcher

Hey list,
  I'm sure this issue was asked several times, but a quick look in the 
nutch user archive did not help, so:

Has anyone documentation or tried to use a browser (like chromium) or 
phantomjs etc. for fetching web pages?

Due to a heavily loaded javascript site, nutch needs to see the fully 
rendered page.

Second question, would it be better to implement it as plugin or rather 
native in the fetcher class?

Regards,
  Patrick

Re: Nutch use a Browser or phantomjs as fetcher

Posted by remi tassing <ta...@gmail.com>.

Hi,

I'm planning on modifying protocol-httpclient (HttpResponse.java) based on
this PhantonJSDriver tutorial:
http://assertselenium.com/2013/03/25/getting-started-with-ghostdriver-phantomjs/

I will let you know how it works out

Remi


On Wed, Jun 11, 2014 at 5:25 AM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> Hi Patrick
>
> You could look at the protocol-http plugin as an example.
>
> Julien
>
>
> On 10 June 2014 10:22, Patrick Kirsch <pk...@zscho.de> wrote:
>
> > Hey,
> >
> > On 06/10/2014 10:52 AM, Julien Nioche wrote:
> >
> >> Hi
> >>
> >> You can do that as a custom protocol implementation. The fetcher code
> >> would
> >> stay the same but the byte content returned for a given URL would be
> >> produced by phantomjs or whichever selenuim backend you'd to use.
> >>
> > Do you have a documentation/wiki link or example to start from?
> >
> > Currently I implemented it in
> > src/java/org/apache/nutch/fetcher/Fetcher.java
> > as hook, if it contains "html" and "head" in the first 500 characters.
> >
> > Regards,
> >  Patrick
> >
> >
> >> HTH
> >>
> >> Julien
> >>
> >>
> >> On 7 June 2014 11:35, remi tassing <ta...@gmail.com> wrote:
> >>
> >>  I'm currently looking at those separately but an integrated option
> would
> >>> be
> >>> more efficient.
> >>>
> >>> Looking forward for any experience sharing
> >>>
> >>>
> >>> On Sat, Jun 7, 2014 at 6:25 PM, Patrick Kirsch <pk...@zscho.de>
> wrote:
> >>>
> >>>  Hey list,
> >>>>   I'm sure this issue was asked several times, but a quick look in the
> >>>> nutch user archive did not help, so:
> >>>>
> >>>> Has anyone documentation or tried to use a browser (like chromium) or
> >>>> phantomjs etc. for fetching web pages?
> >>>>
> >>>> Due to a heavily loaded javascript site, nutch needs to see the fully
> >>>> rendered page.
> >>>>
> >>>> Second question, would it be better to implement it as plugin or
> rather
> >>>> native in the fetcher class?
> >>>>
> >>>> Regards,
> >>>>   Patrick
> >>>>
> >>>>
> >>>>
> >>>
> >>
> >>
> >>
> >
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Re: Nutch use a Browser or phantomjs as fetcher

Posted by Julien Nioche <li...@gmail.com>.

Hi Patrick

You could look at the protocol-http plugin as an example.

Julien


On 10 June 2014 10:22, Patrick Kirsch <pk...@zscho.de> wrote:

> Hey,
>
> On 06/10/2014 10:52 AM, Julien Nioche wrote:
>
>> Hi
>>
>> You can do that as a custom protocol implementation. The fetcher code
>> would
>> stay the same but the byte content returned for a given URL would be
>> produced by phantomjs or whichever selenuim backend you'd to use.
>>
> Do you have a documentation/wiki link or example to start from?
>
> Currently I implemented it in
> src/java/org/apache/nutch/fetcher/Fetcher.java
> as hook, if it contains "html" and "head" in the first 500 characters.
>
> Regards,
>  Patrick
>
>
>> HTH
>>
>> Julien
>>
>>
>> On 7 June 2014 11:35, remi tassing <ta...@gmail.com> wrote:
>>
>>  I'm currently looking at those separately but an integrated option would
>>> be
>>> more efficient.
>>>
>>> Looking forward for any experience sharing
>>>
>>>
>>> On Sat, Jun 7, 2014 at 6:25 PM, Patrick Kirsch <pk...@zscho.de> wrote:
>>>
>>>  Hey list,
>>>>   I'm sure this issue was asked several times, but a quick look in the
>>>> nutch user archive did not help, so:
>>>>
>>>> Has anyone documentation or tried to use a browser (like chromium) or
>>>> phantomjs etc. for fetching web pages?
>>>>
>>>> Due to a heavily loaded javascript site, nutch needs to see the fully
>>>> rendered page.
>>>>
>>>> Second question, would it be better to implement it as plugin or rather
>>>> native in the fetcher class?
>>>>
>>>> Regards,
>>>>   Patrick
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>


-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Nutch use a Browser or phantomjs as fetcher

Posted by Patrick Kirsch <pk...@zscho.de>.

Hey,
On 06/10/2014 10:52 AM, Julien Nioche wrote:
> Hi
>
> You can do that as a custom protocol implementation. The fetcher code would
> stay the same but the byte content returned for a given URL would be
> produced by phantomjs or whichever selenuim backend you'd to use.
Do you have a documentation/wiki link or example to start from?

Currently I implemented it in
src/java/org/apache/nutch/fetcher/Fetcher.java
as hook, if it contains "html" and "head" in the first 500 characters.

Regards,
  Patrick
>
> HTH
>
> Julien
>
>
> On 7 June 2014 11:35, remi tassing <ta...@gmail.com> wrote:
>
>> I'm currently looking at those separately but an integrated option would be
>> more efficient.
>>
>> Looking forward for any experience sharing
>>
>>
>> On Sat, Jun 7, 2014 at 6:25 PM, Patrick Kirsch <pk...@zscho.de> wrote:
>>
>>> Hey list,
>>>   I'm sure this issue was asked several times, but a quick look in the
>>> nutch user archive did not help, so:
>>>
>>> Has anyone documentation or tried to use a browser (like chromium) or
>>> phantomjs etc. for fetching web pages?
>>>
>>> Due to a heavily loaded javascript site, nutch needs to see the fully
>>> rendered page.
>>>
>>> Second question, would it be better to implement it as plugin or rather
>>> native in the fetcher class?
>>>
>>> Regards,
>>>   Patrick
>>>
>>>
>>
>
>
>

Re: Nutch use a Browser or phantomjs as fetcher

Posted by Julien Nioche <li...@gmail.com>.

Hi

You can do that as a custom protocol implementation. The fetcher code would
stay the same but the byte content returned for a given URL would be
produced by phantomjs or whichever selenuim backend you'd to use.

HTH

Julien


On 7 June 2014 11:35, remi tassing <ta...@gmail.com> wrote:

> I'm currently looking at those separately but an integrated option would be
> more efficient.
>
> Looking forward for any experience sharing
>
>
> On Sat, Jun 7, 2014 at 6:25 PM, Patrick Kirsch <pk...@zscho.de> wrote:
>
> > Hey list,
> >  I'm sure this issue was asked several times, but a quick look in the
> > nutch user archive did not help, so:
> >
> > Has anyone documentation or tried to use a browser (like chromium) or
> > phantomjs etc. for fetching web pages?
> >
> > Due to a heavily loaded javascript site, nutch needs to see the fully
> > rendered page.
> >
> > Second question, would it be better to implement it as plugin or rather
> > native in the fetcher class?
> >
> > Regards,
> >  Patrick
> >
> >
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Re: Nutch use a Browser or phantomjs as fetcher

Posted by remi tassing <ta...@gmail.com>.

I'm currently looking at those separately but an integrated option would be
more efficient.

Looking forward for any experience sharing


On Sat, Jun 7, 2014 at 6:25 PM, Patrick Kirsch <pk...@zscho.de> wrote:

> Hey list,
>  I'm sure this issue was asked several times, but a quick look in the
> nutch user archive did not help, so:
>
> Has anyone documentation or tried to use a browser (like chromium) or
> phantomjs etc. for fetching web pages?
>
> Due to a heavily loaded javascript site, nutch needs to see the fully
> rendered page.
>
> Second question, would it be better to implement it as plugin or rather
> native in the fetcher class?
>
> Regards,
>  Patrick
>
>