You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Mohammed Omer <be...@gmail.com> on 2014/08/01 00:43:52 UTC

Re: [New Nutch Plugin] Delegate fetching to Selenium/Firefox for those jobs where you neeeeed javascript parsing

Hey Julien,

I definitely should have thanked all the work that goes into Nutch before
that (at least I said that Nutch was an awesome, world class, web crawler
though!). I get that patches are in the hands of the community, but for
someone like me or the person who submitted
https://issues.apache.org/jira/browse/NUTCH-1323 and asked for input, it
didn't seem like any existing committers were willing to vote, review it,
etc.

I'll keep that in mind though about being more vocal and active in this and
other Apache projects I use/am interested in!

Back-porting this to Nutch 1.x isn't something I plan on doing; but, if
someone using 1.x and would like to make a PR for a 1.x branch, that'd be
jiggy and I'd merge it in.

Thank you,

Mo


On Thu, Jul 31, 2014 at 2:56 AM, Julien Nioche <
lists.digitalpebble@gmail.com> wrote:

> Hi,
>
> Just to add to what Seb said below :
>
>
>
>
>
>
>
>
> *> (from
> https://github.com/momer/nutch-selenium-grid-plugin#nutch-selenium
> <https://github.com/momer/nutch-selenium-grid-plugin#nutch-selenium>)> C)
> Not have to wait another 2 years for Nutch to patch in either the Ajax
> crawler> hashbang workaround and then, not having to patch it to get the
> use case of ammending the> original url with the hashbang-workaround's
> content.Your are right: it's a shame for many issues and patches lying
> aroundfor years until they get integrated. On the other hand: everyoneis
> welcome to participate, provide and review patches, improve codeand
> documentation, etc.  There is lot of work to do...*
>
> Open source projects like Nutch rely on the participation of the community.
> Everyone is welcome to contribute is any way possible.
> If you wanted NUTCH-1323 to be committed quicker you could have helped
> review the patch, voted for it, expressed yourself on the mailing list,
> etc... Nutch is not a top-down organisation where things are decided
> entirely by PMC members but an evolutionary process where things get done
> because they are needed, get improved because they are used and so on...
> Your contribution with this plugin is a good example of this : you needed
> it, shared it and it might get improved as more people start using it.
>
> Glad to see interest, and more importantly, people still interested in
> > nutch on the mailing list!
>
>
> Crawling is a bit of a niche activity and the traffic on the lists is never
> huge but Nutch is a very healthy project, and keeps getting better and
> better (even if some JIRA issues to not get committed very quickly). Having
> to maintain 2 versions definitely doesn't help focusing the effort.
>
> BTW what about porting your plugin to Nutch 1.x?
>
> Thanks again for sharing your work
>
> Julien
>
>
>
>
>
>
> On 31 July 2014 06:25, Mo Omer <be...@gmail.com> wrote:
>
> > Sorry for the multiple emails, I didn't see the rest of your email
> > Sebastian.
> >
> > Re httpclient - I had a total of just a few hours to hack together my
> > previous selenium stand alone plugin, and even less time to put together
> > this solution so there is looooots of stuff that can be pulled out that's
> > leftover from httpclient!
> >
> > Unfortunately lately my work queue is heavy; and, I've already moved on
> > from the project using this plugin. I'll happily look at and merge PRs,
> but
> > can't promise any additional refactoring or curation on my end.
> >
> > I will put together a tutorial, as I mentioned in the previous email,
> > showing
> >
> > A) What selenium is
> > B) Why it's a good compromise
> > C) Setting up Selenium Hub on Ubuntu 14.04
> > D) Setting up Selenium Node on Ubuntu 14.04
> > E) Some issues I've encountered with selenium node
> >
> > Glad to see interest, and more importantly, people still interested in
> > nutch on the mailing list!
> >
> > Thank you,
> >
> > Mo
> >
> > This message was drafted on a tiny touch screen; please forgive brevity &
> > tpyos
> >
> > > On Jul 30, 2014, at 5:22 PM, Sebastian Nagel <
> wastl.nagel@googlemail.com>
> > wrote:
> > >
> > > Hi Mohammed,
> > >
> > > sounds interesting. I'll give it a try soon.
> > >
> > >> I've been using it in production for a month now; and, there are some
> > >> obvious things that need patching like
> > >> - Enabling for https pages
> > >> - It would probably be best for the overall use case to retrieve all
> of
> > the
> > >> document's html, rather than just a <body> tag (if exists).
> > > At a first glance, looks like long passages of code are from
> > protocol-http.
> > > Would be good to pull-out the parts specific to selenium and integrate
> > > them with the existing code base. This might require some refactoring.
> > >
> > >> (from
> > https://github.com/momer/nutch-selenium-grid-plugin#nutch-selenium)
> > >> C) Not have to wait another 2 years for Nutch to patch in either the
> > Ajax crawler
> > >> hashbang workaround and then, not having to patch it to get the use
> > case of ammending the
> > >> original url with the hashbang-workaround's content.
> > > Your are right: it's a shame for many issues and patches lying around
> > > for years until they get integrated. On the other hand: everyone
> > > is welcome to participate, provide and review patches, improve code
> > > and documentation, etc.  There is lot of work to do...
> > >
> > > Thanks for sharing the plugin,
> > > would be great to here more from you!
> > >
> > > Sebastian
> > >
> > >
> > >
> > >> On 07/30/2014 09:26 PM, Lewis John Mcgibbney wrote:
> > >> This looks fantastic. Are you interested in bringing in into the
> > codebase?I
> > >> think that this would be very useful to many users of Nutch and would
> be
> > >> extremely interested in hashing out a patch with you in order to do
> so.
> > >> Thanks
> > >> Lewis
> > >
> > >
> > >> On 07/29/2014 04:26 PM, Mohammed Omer wrote:
> > >> Morning everyone,
> > >>
> > >> Figured I'd share out a little plugin that delegates fetching and
> > crawling
> > >> to a Selenium Hub/Node system, so that you can rely on Firefox to
> > correctly
> > >> render and parse javascript as it would, and Selenium to pull out the
> > >> content you care about.
> > >>
> > >> At the moment, the plugin is set to pull just the innerHTML of the
> > page's
> > >> <body>; as I just needed a quick and dirty fix. It's forked from my
> > >> patching of another user's previous attempt at getting Selenium
> > standalone
> > >> working with Nutch; that was in turn a fork of httpclient. That worked
> > >> fine, but it was vulnerable to leaving lots of zombie processes
> hanging
> > >> around when errors occurred. I pretty much just patched it enough to
> > get it
> > >> working - so if you end up using it and patching things / removing
> > >> unnecessaries, send them up on a PR!
> > >>
> > >> Here, we rely on Selenium Hub/Node's self-healing set-up, and just
> pass
> > >> requests for pages to that system, and receive html content as the
> > response.
> > >>
> > >> I've been using it in production for a month now; and, there are some
> > >> obvious things that need patching like
> > >>
> > >> - Enabling for https pages
> > >> - It would probably be best for the overall use case to retrieve all
> of
> > the
> > >> document's html, rather than just a <body> tag (if exists).
> > >>
> > >> Available at: https://github.com/momer/nutch-selenium-grid-plugin
> > >
> >
>
>
>
> --
>
> Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> http://twitter.com/digitalpebble
>

Re: [New Nutch Plugin] Delegate fetching to Selenium/Firefox for those jobs where you neeeeed javascript parsing

Posted by Mohammed Omer <be...@gmail.com>.
All, a little post about how I arrived at using Selenium with Nutch is at
http://soryy.com/blog/2014/ajax-javascript-enabled-parsing-apache-nutch-selenium/

I didn't have time to also go through setting up the individual components,
but I'll save that for next week.

Figured it might make for a fun read for you all, and a reminder that while
many sites promise to implement a work-around, not all of them keep that
promise!

Mo


On Thu, Jul 31, 2014 at 5:43 PM, Mohammed Omer <be...@gmail.com>
wrote:

> Hey Julien,
>
> I definitely should have thanked all the work that goes into Nutch before
> that (at least I said that Nutch was an awesome, world class, web crawler
> though!). I get that patches are in the hands of the community, but for
> someone like me or the person who submitted
> https://issues.apache.org/jira/browse/NUTCH-1323 and asked for input, it
> didn't seem like any existing committers were willing to vote, review it,
> etc.
>
> I'll keep that in mind though about being more vocal and active in this
> and other Apache projects I use/am interested in!
>
> Back-porting this to Nutch 1.x isn't something I plan on doing; but, if
> someone using 1.x and would like to make a PR for a 1.x branch, that'd be
> jiggy and I'd merge it in.
>
> Thank you,
>
> Mo
>
>
> On Thu, Jul 31, 2014 at 2:56 AM, Julien Nioche <
> lists.digitalpebble@gmail.com> wrote:
>
>> Hi,
>>
>> Just to add to what Seb said below :
>>
>>
>>
>>
>>
>>
>>
>>
>> *> (from
>> https://github.com/momer/nutch-selenium-grid-plugin#nutch-selenium
>> <https://github.com/momer/nutch-selenium-grid-plugin#nutch-selenium>)> C)
>> Not have to wait another 2 years for Nutch to patch in either the Ajax
>> crawler> hashbang workaround and then, not having to patch it to get the
>> use case of ammending the> original url with the hashbang-workaround's
>> content.Your are right: it's a shame for many issues and patches lying
>> aroundfor years until they get integrated. On the other hand: everyoneis
>> welcome to participate, provide and review patches, improve codeand
>> documentation, etc.  There is lot of work to do...*
>>
>> Open source projects like Nutch rely on the participation of the
>> community.
>> Everyone is welcome to contribute is any way possible.
>> If you wanted NUTCH-1323 to be committed quicker you could have helped
>> review the patch, voted for it, expressed yourself on the mailing list,
>> etc... Nutch is not a top-down organisation where things are decided
>> entirely by PMC members but an evolutionary process where things get done
>> because they are needed, get improved because they are used and so on...
>> Your contribution with this plugin is a good example of this : you needed
>> it, shared it and it might get improved as more people start using it.
>>
>> Glad to see interest, and more importantly, people still interested in
>> > nutch on the mailing list!
>>
>>
>> Crawling is a bit of a niche activity and the traffic on the lists is
>> never
>> huge but Nutch is a very healthy project, and keeps getting better and
>> better (even if some JIRA issues to not get committed very quickly).
>> Having
>> to maintain 2 versions definitely doesn't help focusing the effort.
>>
>> BTW what about porting your plugin to Nutch 1.x?
>>
>> Thanks again for sharing your work
>>
>> Julien
>>
>>
>>
>>
>>
>>
>> On 31 July 2014 06:25, Mo Omer <be...@gmail.com> wrote:
>>
>> > Sorry for the multiple emails, I didn't see the rest of your email
>> > Sebastian.
>> >
>> > Re httpclient - I had a total of just a few hours to hack together my
>> > previous selenium stand alone plugin, and even less time to put together
>> > this solution so there is looooots of stuff that can be pulled out
>> that's
>> > leftover from httpclient!
>> >
>> > Unfortunately lately my work queue is heavy; and, I've already moved on
>> > from the project using this plugin. I'll happily look at and merge PRs,
>> but
>> > can't promise any additional refactoring or curation on my end.
>> >
>> > I will put together a tutorial, as I mentioned in the previous email,
>> > showing
>> >
>> > A) What selenium is
>> > B) Why it's a good compromise
>> > C) Setting up Selenium Hub on Ubuntu 14.04
>> > D) Setting up Selenium Node on Ubuntu 14.04
>> > E) Some issues I've encountered with selenium node
>> >
>> > Glad to see interest, and more importantly, people still interested in
>> > nutch on the mailing list!
>> >
>> > Thank you,
>> >
>> > Mo
>> >
>> > This message was drafted on a tiny touch screen; please forgive brevity
>> &
>> > tpyos
>> >
>> > > On Jul 30, 2014, at 5:22 PM, Sebastian Nagel <
>> wastl.nagel@googlemail.com>
>> > wrote:
>> > >
>> > > Hi Mohammed,
>> > >
>> > > sounds interesting. I'll give it a try soon.
>> > >
>> > >> I've been using it in production for a month now; and, there are some
>> > >> obvious things that need patching like
>> > >> - Enabling for https pages
>> > >> - It would probably be best for the overall use case to retrieve all
>> of
>> > the
>> > >> document's html, rather than just a <body> tag (if exists).
>> > > At a first glance, looks like long passages of code are from
>> > protocol-http.
>> > > Would be good to pull-out the parts specific to selenium and integrate
>> > > them with the existing code base. This might require some refactoring.
>> > >
>> > >> (from
>> > https://github.com/momer/nutch-selenium-grid-plugin#nutch-selenium)
>> > >> C) Not have to wait another 2 years for Nutch to patch in either the
>> > Ajax crawler
>> > >> hashbang workaround and then, not having to patch it to get the use
>> > case of ammending the
>> > >> original url with the hashbang-workaround's content.
>> > > Your are right: it's a shame for many issues and patches lying around
>> > > for years until they get integrated. On the other hand: everyone
>> > > is welcome to participate, provide and review patches, improve code
>> > > and documentation, etc.  There is lot of work to do...
>> > >
>> > > Thanks for sharing the plugin,
>> > > would be great to here more from you!
>> > >
>> > > Sebastian
>> > >
>> > >
>> > >
>> > >> On 07/30/2014 09:26 PM, Lewis John Mcgibbney wrote:
>> > >> This looks fantastic. Are you interested in bringing in into the
>> > codebase?I
>> > >> think that this would be very useful to many users of Nutch and
>> would be
>> > >> extremely interested in hashing out a patch with you in order to do
>> so.
>> > >> Thanks
>> > >> Lewis
>> > >
>> > >
>> > >> On 07/29/2014 04:26 PM, Mohammed Omer wrote:
>> > >> Morning everyone,
>> > >>
>> > >> Figured I'd share out a little plugin that delegates fetching and
>> > crawling
>> > >> to a Selenium Hub/Node system, so that you can rely on Firefox to
>> > correctly
>> > >> render and parse javascript as it would, and Selenium to pull out the
>> > >> content you care about.
>> > >>
>> > >> At the moment, the plugin is set to pull just the innerHTML of the
>> > page's
>> > >> <body>; as I just needed a quick and dirty fix. It's forked from my
>> > >> patching of another user's previous attempt at getting Selenium
>> > standalone
>> > >> working with Nutch; that was in turn a fork of httpclient. That
>> worked
>> > >> fine, but it was vulnerable to leaving lots of zombie processes
>> hanging
>> > >> around when errors occurred. I pretty much just patched it enough to
>> > get it
>> > >> working - so if you end up using it and patching things / removing
>> > >> unnecessaries, send them up on a PR!
>> > >>
>> > >> Here, we rely on Selenium Hub/Node's self-healing set-up, and just
>> pass
>> > >> requests for pages to that system, and receive html content as the
>> > response.
>> > >>
>> > >> I've been using it in production for a month now; and, there are some
>> > >> obvious things that need patching like
>> > >>
>> > >> - Enabling for https pages
>> > >> - It would probably be best for the overall use case to retrieve all
>> of
>> > the
>> > >> document's html, rather than just a <body> tag (if exists).
>> > >>
>> > >> Available at: https://github.com/momer/nutch-selenium-grid-plugin
>> > >
>> >
>>
>>
>>
>> --
>>
>> Open Source Solutions for Text Engineering
>>
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>> http://twitter.com/digitalpebble
>>
>
>