You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by Charan Shampur <ch...@gmail.com> on 2015/10/03 20:25:04 UTC

Unable to fetch content after integrating selenium

Hello developers,

I extended the interactive selenium interface to write a custom handler,
which automatically fills the basic login information and enters the page.
This provides access for nutch to crawl the members area. After starting
the crawl i could see the web browser getting launched and filling the
login page, after which the control goes back to nutch(As
expected),However  To my surprise the firefox driver is called again and
the same page is loaded but this time it does not log in instead it fails
with http code : 403. I have just one URL in the seed list.

I am unable to figure out as to what is going wrong, any guidelines will be
of great help to us.

Thanks
Charan

Re: Unable to fetch content after integrating selenium

Posted by Michael Joyce <jo...@apache.org>.

What value do you have set for interactiveselenium.handlers? If you have
multiple handlers there they're each going to be called on the URL. So if
you have authentication then you're going to need to do it in each handler.


-- Jimmy

On Sat, Oct 3, 2015 at 5:29 PM, crawl party <cr...@gmail.com> wrote:

> I think it's because the first time Nutch calls your custom handler and
> the second time it calls the default handler which doesn't do the login
> stuff.
>
> On Sat, Oct 3, 2015 at 11:25 AM, Charan Shampur <ch...@gmail.com>
> wrote:
>
>> Hello developers,
>>
>> I extended the interactive selenium interface to write a custom handler,
>> which automatically fills the basic login information and enters the page.
>> This provides access for nutch to crawl the members area. After starting
>> the crawl i could see the web browser getting launched and filling the
>> login page, after which the control goes back to nutch(As
>> expected),However  To my surprise the firefox driver is called again and
>> the same page is loaded but this time it does not log in instead it fails
>> with http code : 403. I have just one URL in the seed list.
>>
>> I am unable to figure out as to what is going wrong, any guidelines will
>> be of great help to us.
>>
>> Thanks
>> Charan
>>
>>
>>
>>
>>
>>
>>
>

Re: Unable to fetch content after integrating selenium

Posted by crawl party <cr...@gmail.com>.

I think it's because the first time Nutch calls your custom handler and the
second time it calls the default handler which doesn't do the login stuff.

On Sat, Oct 3, 2015 at 11:25 AM, Charan Shampur <ch...@gmail.com>
wrote:

> Hello developers,
>
> I extended the interactive selenium interface to write a custom handler,
> which automatically fills the basic login information and enters the page.
> This provides access for nutch to crawl the members area. After starting
> the crawl i could see the web browser getting launched and filling the
> login page, after which the control goes back to nutch(As
> expected),However  To my surprise the firefox driver is called again and
> the same page is loaded but this time it does not log in instead it fails
> with http code : 403. I have just one URL in the seed list.
>
> I am unable to figure out as to what is going wrong, any guidelines will
> be of great help to us.
>
> Thanks
> Charan
>
>
>
>
>
>
>