You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Kamil Mroczek <ka...@elio.earth> on 2023/01/17 17:48:44 UTC

Upgrading Selenium

Hello,

I am sending a message to inquire whether I should submit a patch which
updates selenium to the latest version. Although it is a major version
upgrade to the library, very few code changes were needed to update.

For a preview of the changes I made you can look here
<https://github.com/Elio-Earth/nutch/commit/9960f14bce0f0d6cebc406556a298a7c8c2e6b9f>.
Although not used in the code anymore (it was commented out), PhantomJS
support has been removed from Selenium in the latest version. The commit
also removes Opera since it was commented out but I can leave that in if
needed. The build and tests pass. I have been using the Chrome driver
successfully with it and would just need to run a quick test with Firefox
to make sure it works too.

I have only been using Nutch for about a month but have spent quite a bit
of time looking over different parts of the code to understand how to
configure it and change it.

Kamil

Re: Upgrading Selenium

Posted by Markus Jelsma <ma...@openindex.io>.
> There must be a way, some how, some time.

There isn't:
https://github.com/seleniumhq/selenium-google-code-issue-archive/issues/141

Op do 19 jan. 2023 om 15:23 schreef Markus Jelsma <
markus.jelsma@openindex.io>:

> > This makes some sense if you do not know anything about the URL.
> > - a HEAD request could do almost the same
> > - often one knows whether there are only HTML pages or also PDFs, zip
> files,
> >    and other stuff not suitable for Selenium. Could make the HEAD request
> >    optional.
>
> Ah crap, i forgot about that. With Selenium, it is still not possible to
> get the HTTP headers of the most recent request. And when requesting the
> page source, it will either return nothing, or the previous 'successful'
> call when requesting a non-text MIME-type URL.
>
> Besides doing a HEAD request first, there is no neat way to work with
> non-text/html URLs as we can using HtmlUnit. That at least returns the
> headers and the raw binary data.
>
> There must be a way, some how, some time.
>
> Thanks,
> Markus
>
> Op do 19 jan. 2023 om 11:38 schreef Sebastian Nagel <
> wastl.nagel@googlemail.com>:
>
>> Hi Kamil, hi Markus,
>>
>> upgrading the Selenium plugin is very appreciated!
>>
>>  > Besides that, the plugin also needs some overhaul.
>>
>> Definitely.
>>
>>  > It currently first downloads the URL with HttpClient, and then,
>> depending on
>>  > MIME-type, it may or may not forward the URL to Selenium so it can be
>>  > downloaded again.
>>
>> This makes some sense if you do not know anything about the URL.
>> - a HEAD request could do almost the same
>> - often one knows whether there are only HTML pages or also PDFs, zip
>> files,
>>    and other stuff not suitable for Selenium. Could make the HEAD request
>>    optional.
>>
>>  > merging the lib-selenium plugin with the protocol-selenium plugin
>>
>> I guess lib-selenium is to share common components between
>> protocol-selenium and
>> protocol-interactiveselenium. Maybe merge all three? Or skip
>> interactiveselenium
>> for now.
>>
>> ~Sebastian
>>
>> On 1/17/23 19:56, Markus Jelsma wrote:
>> > Hello Kamil,
>> >
>> > Yes, the plugin needs some upgrading indeed. We use a modern version of
>> it
>> > elsewhere and it works really well, at least better than HtmlUnit.
>> >
>> > Besides that, the plugin also needs some overhaul. It currently first
>> downloads
>> > the URL with HttpClient, and then, depending on MIME-type, it may or
>> may not
>> > forward the URL to Selenium so it can be downloaded again.
>> >
>> > There is a lot of code in the plugin that should be removed. I would
>> also opt
>> > for merging the lib-selenium plugin with the protocol-selenium plugin.
>> There is
>> > no obvious need for having it separated.
>> >
>> > These can be, of course, separate tasks.
>> >
>> > Regards,
>> > Markus
>> >
>> > Op di 17 jan. 2023 om 17:49 schreef Kamil Mroczek <ka...@elio.earth>:
>> >
>> >     Hello,
>> >
>> >     I am sending a message to inquire whether I should submit a patch
>> which
>> >     updates selenium to the latest version. Although it is a major
>> version
>> >     upgrade to the library, very few code changes were needed to update.
>> >
>> >     For a preview of the changes I made you can look here
>> >     <
>> https://github.com/Elio-Earth/nutch/commit/9960f14bce0f0d6cebc406556a298a7c8c2e6b9f>.
>> Although not used in the code anymore (it was commented out), PhantomJS
>> support has been removed from Selenium in the latest version. The commit
>> also removes Opera since it was commented out but I can leave that in if
>> needed. The build and tests pass. I have been using the Chrome driver
>> successfully with it and would just need to run a quick test with Firefox
>> to make sure it works too.
>> >
>> >     I have only been using Nutch for about a month but have spent quite
>> a bit of
>> >     time looking over different parts of the code to understand how to
>> configure
>> >     it and change it.
>> >
>> >     Kamil
>> >
>>
>

Re: Upgrading Selenium

Posted by Markus Jelsma <ma...@openindex.io>.
> This makes some sense if you do not know anything about the URL.
> - a HEAD request could do almost the same
> - often one knows whether there are only HTML pages or also PDFs, zip
files,
>    and other stuff not suitable for Selenium. Could make the HEAD request
>    optional.

Ah crap, i forgot about that. With Selenium, it is still not possible to
get the HTTP headers of the most recent request. And when requesting the
page source, it will either return nothing, or the previous 'successful'
call when requesting a non-text MIME-type URL.

Besides doing a HEAD request first, there is no neat way to work with
non-text/html URLs as we can using HtmlUnit. That at least returns the
headers and the raw binary data.

There must be a way, some how, some time.

Thanks,
Markus

Op do 19 jan. 2023 om 11:38 schreef Sebastian Nagel <
wastl.nagel@googlemail.com>:

> Hi Kamil, hi Markus,
>
> upgrading the Selenium plugin is very appreciated!
>
>  > Besides that, the plugin also needs some overhaul.
>
> Definitely.
>
>  > It currently first downloads the URL with HttpClient, and then,
> depending on
>  > MIME-type, it may or may not forward the URL to Selenium so it can be
>  > downloaded again.
>
> This makes some sense if you do not know anything about the URL.
> - a HEAD request could do almost the same
> - often one knows whether there are only HTML pages or also PDFs, zip
> files,
>    and other stuff not suitable for Selenium. Could make the HEAD request
>    optional.
>
>  > merging the lib-selenium plugin with the protocol-selenium plugin
>
> I guess lib-selenium is to share common components between
> protocol-selenium and
> protocol-interactiveselenium. Maybe merge all three? Or skip
> interactiveselenium
> for now.
>
> ~Sebastian
>
> On 1/17/23 19:56, Markus Jelsma wrote:
> > Hello Kamil,
> >
> > Yes, the plugin needs some upgrading indeed. We use a modern version of
> it
> > elsewhere and it works really well, at least better than HtmlUnit.
> >
> > Besides that, the plugin also needs some overhaul. It currently first
> downloads
> > the URL with HttpClient, and then, depending on MIME-type, it may or may
> not
> > forward the URL to Selenium so it can be downloaded again.
> >
> > There is a lot of code in the plugin that should be removed. I would
> also opt
> > for merging the lib-selenium plugin with the protocol-selenium plugin.
> There is
> > no obvious need for having it separated.
> >
> > These can be, of course, separate tasks.
> >
> > Regards,
> > Markus
> >
> > Op di 17 jan. 2023 om 17:49 schreef Kamil Mroczek <ka...@elio.earth>:
> >
> >     Hello,
> >
> >     I am sending a message to inquire whether I should submit a patch
> which
> >     updates selenium to the latest version. Although it is a major
> version
> >     upgrade to the library, very few code changes were needed to update.
> >
> >     For a preview of the changes I made you can look here
> >     <
> https://github.com/Elio-Earth/nutch/commit/9960f14bce0f0d6cebc406556a298a7c8c2e6b9f>.
> Although not used in the code anymore (it was commented out), PhantomJS
> support has been removed from Selenium in the latest version. The commit
> also removes Opera since it was commented out but I can leave that in if
> needed. The build and tests pass. I have been using the Chrome driver
> successfully with it and would just need to run a quick test with Firefox
> to make sure it works too.
> >
> >     I have only been using Nutch for about a month but have spent quite
> a bit of
> >     time looking over different parts of the code to understand how to
> configure
> >     it and change it.
> >
> >     Kamil
> >
>

Re: Upgrading Selenium

Posted by Sebastian Nagel <wa...@googlemail.com>.
Hi Kamil, hi Markus,

upgrading the Selenium plugin is very appreciated!

 > Besides that, the plugin also needs some overhaul.

Definitely.

 > It currently first downloads the URL with HttpClient, and then, depending on
 > MIME-type, it may or may not forward the URL to Selenium so it can be
 > downloaded again.

This makes some sense if you do not know anything about the URL.
- a HEAD request could do almost the same
- often one knows whether there are only HTML pages or also PDFs, zip files,
   and other stuff not suitable for Selenium. Could make the HEAD request
   optional.

 > merging the lib-selenium plugin with the protocol-selenium plugin

I guess lib-selenium is to share common components between protocol-selenium and 
protocol-interactiveselenium. Maybe merge all three? Or skip interactiveselenium
for now.

~Sebastian

On 1/17/23 19:56, Markus Jelsma wrote:
> Hello Kamil,
> 
> Yes, the plugin needs some upgrading indeed. We use a modern version of it 
> elsewhere and it works really well, at least better than HtmlUnit.
> 
> Besides that, the plugin also needs some overhaul. It currently first downloads 
> the URL with HttpClient, and then, depending on MIME-type, it may or may not 
> forward the URL to Selenium so it can be downloaded again.
> 
> There is a lot of code in the plugin that should be removed. I would also opt 
> for merging the lib-selenium plugin with the protocol-selenium plugin. There is 
> no obvious need for having it separated.
> 
> These can be, of course, separate tasks.
> 
> Regards,
> Markus
> 
> Op di 17 jan. 2023 om 17:49 schreef Kamil Mroczek <ka...@elio.earth>:
> 
>     Hello,
> 
>     I am sending a message to inquire whether I should submit a patch which
>     updates selenium to the latest version. Although it is a major version
>     upgrade to the library, very few code changes were needed to update.
> 
>     For a preview of the changes I made you can look here
>     <https://github.com/Elio-Earth/nutch/commit/9960f14bce0f0d6cebc406556a298a7c8c2e6b9f>. Although not used in the code anymore (it was commented out), PhantomJS support has been removed from Selenium in the latest version. The commit also removes Opera since it was commented out but I can leave that in if needed. The build and tests pass. I have been using the Chrome driver successfully with it and would just need to run a quick test with Firefox to make sure it works too.
> 
>     I have only been using Nutch for about a month but have spent quite a bit of
>     time looking over different parts of the code to understand how to configure
>     it and change it.
> 
>     Kamil
> 

Re: Upgrading Selenium

Posted by Kamil Mroczek <ka...@elio.earth>.
Thanks Markus. Let me submit the upgrade first to get my first commit in
and then go from there. That optimization of reducing the number of HTTP
requests will useful so I will look into that.

On Tue, Jan 17, 2023 at 1:56 PM Markus Jelsma <ma...@openindex.io>
wrote:

> Hello Kamil,
>
> Yes, the plugin needs some upgrading indeed. We use a modern version of it
> elsewhere and it works really well, at least better than HtmlUnit.
>
> Besides that, the plugin also needs some overhaul. It currently first
> downloads the URL with HttpClient, and then, depending on MIME-type, it may
> or may not forward the URL to Selenium so it can be downloaded again.
>
> There is a lot of code in the plugin that should be removed. I would also
> opt for merging the lib-selenium plugin with the protocol-selenium plugin.
> There is no obvious need for having it separated.
>
> These can be, of course, separate tasks.
>
> Regards,
> Markus
>
> Op di 17 jan. 2023 om 17:49 schreef Kamil Mroczek <ka...@elio.earth>:
>
>> Hello,
>>
>> I am sending a message to inquire whether I should submit a patch which
>> updates selenium to the latest version. Although it is a major version
>> upgrade to the library, very few code changes were needed to update.
>>
>> For a preview of the changes I made you can look here
>> <https://github.com/Elio-Earth/nutch/commit/9960f14bce0f0d6cebc406556a298a7c8c2e6b9f>.
>> Although not used in the code anymore (it was commented out), PhantomJS
>> support has been removed from Selenium in the latest version. The commit
>> also removes Opera since it was commented out but I can leave that in if
>> needed. The build and tests pass. I have been using the Chrome driver
>> successfully with it and would just need to run a quick test with Firefox
>> to make sure it works too.
>>
>> I have only been using Nutch for about a month but have spent quite a bit
>> of time looking over different parts of the code to understand how to
>> configure it and change it.
>>
>> Kamil
>>
>

Re: Upgrading Selenium

Posted by Markus Jelsma <ma...@openindex.io>.
Hello Kamil,

Yes, the plugin needs some upgrading indeed. We use a modern version of it
elsewhere and it works really well, at least better than HtmlUnit.

Besides that, the plugin also needs some overhaul. It currently first
downloads the URL with HttpClient, and then, depending on MIME-type, it may
or may not forward the URL to Selenium so it can be downloaded again.

There is a lot of code in the plugin that should be removed. I would also
opt for merging the lib-selenium plugin with the protocol-selenium plugin.
There is no obvious need for having it separated.

These can be, of course, separate tasks.

Regards,
Markus

Op di 17 jan. 2023 om 17:49 schreef Kamil Mroczek <ka...@elio.earth>:

> Hello,
>
> I am sending a message to inquire whether I should submit a patch which
> updates selenium to the latest version. Although it is a major version
> upgrade to the library, very few code changes were needed to update.
>
> For a preview of the changes I made you can look here
> <https://github.com/Elio-Earth/nutch/commit/9960f14bce0f0d6cebc406556a298a7c8c2e6b9f>.
> Although not used in the code anymore (it was commented out), PhantomJS
> support has been removed from Selenium in the latest version. The commit
> also removes Opera since it was commented out but I can leave that in if
> needed. The build and tests pass. I have been using the Chrome driver
> successfully with it and would just need to run a quick test with Firefox
> to make sure it works too.
>
> I have only been using Nutch for about a month but have spent quite a bit
> of time looking over different parts of the code to understand how to
> configure it and change it.
>
> Kamil
>