You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Tizy Ninan <ti...@gmail.com> on 2015/01/02 05:39:21 UTC

Re: HttpPostAuthentication

Hi,

Can somebody give any help on the above issue with HttpPostAuthentication
in Nutch v1.9? I am stuck with this problem for a while. It would be really
helpful if someone could give any insights on the above problem.

Thanks,
Tizy

On Thu, Dec 18, 2014 at 11:38 AM, Tizy Ninan <ti...@gmail.com> wrote:

> Hi,
>
> Thanks for the reply.
>
> I tried applying the patch(http-client-form-authtication.patch)
> in NUTCH-827 [1].  Compiled the code using ant.
>
> When I ran the crawler it is giving the following warning log message, "
> httpclient.Http: Bad auth conf file: Element <removedFormFields> not
> recognized in httpclient-auth.xml - expected <authscope> " .
>
> How do I make sure that the changes in the code is reflected? It seems
> like the changes are not effected while crawling. What is the correct
> procedure to compile the code in the plugins?
>
> Thanks,
> Tizy
>
> On Tue, Dec 16, 2014 at 6:34 PM, remi tassing <ta...@gmail.com>
> wrote:
>>
>> I have been doing a lot of POST authentication while crawling corporate
>> stuff. Since POST methods may vary drastically between sites (e.g. typical
>> JIRA to POST+JS redirection, NTLMv2...) it's hard not to extend the
>> crawler
>> with some additional Java.
>>
>> So what I've ended up doing is to build a "handler" class for each site
>> specific site and that handler knows how to send requests and fetch the
>> contain. Some common response type is expected so it looks like an
>> extension/plugin design for the protocol-httpclient plugin.
>>
>> On Tue, Dec 16, 2014 at 5:46 PM, Tizy Ninan <ti...@gmail.com> wrote:
>> >
>> > Hi Talat,
>> >
>> > Thanks a lot for the reply. I will go through it and try it out.
>> >
>> > Thanks,
>> > Tizy
>> >
>> > On Tue, Dec 16, 2014 at 2:25 PM, Talat Uyarer <ta...@uyarer.com> wrote:
>> > >
>> > > Hi Tizy,
>> > >
>> > > There is some discuss. You can reach at NUTCH-827 [1] IMHO we need
>> > > some help. If we create this feature it will be useful.
>> > >
>> > > Talat
>> > >
>> > > [1] https://issues.apache.org/jira/browse/NUTCH-827
>> > >
>> > > 2014-12-16 10:44 GMT+02:00 Tizy Ninan <ti...@gmail.com>:
>> > > > Hi,
>> > > >
>> > > > Thanks for the reply.
>> > > > Is there any alternative way to do this authentication? Does the
>> > fetcher
>> > > > job of Nutch accept cookies for fetching the web sites from the same
>> > > > domain? Could you suggest any work around to do form based
>> > authentication
>> > > > using Nutch?
>> > > >
>> > > > Thanks,
>> > > > Tizy
>> > > >
>> > > > On Tue, Dec 16, 2014 at 1:08 PM, Halil Ibrahim Simsek <
>> > > simsekhi@gmail.com>
>> > > > wrote:
>> > > >>
>> > > >> Hello Tizy,
>> > > >>
>> > > >> As I know, currently the development version of Nutch can do Basic,
>> > > Digest
>> > > >> and NTLM based authentication. [1] Nutch can not do POST based
>> > > >> authentication that depends on cookies. BTW there is a document
>> which
>> > > >> supposed to provide this feature but as far as i see no code
>> developed
>> > > yet.
>> > > >> [2]
>> > > >>
>> > > >> [1] https://wiki.apache.org/nutch/HttpAuthenticationSchemes
>> > > >> [2] https://wiki.apache.org/nutch/HttpPostAuthentication
>> > > >>
>> > > >> Halil
>> > > >>
>> > > >> 2014-12-16 7:16 GMT+02:00 Tizy Ninan <ti...@gmail.com>:
>> > > >> >
>> > > >> > Hi,
>> > > >> >
>> > > >> > I am trying to develop a custom crawler to crawl websites that
>> > require
>> > > >> form
>> > > >> > based authentication using Nutch v1.9 in Java.  The
>> > > >> HttpPostAuthentication
>> > > >> > feature of Nutch is followed to implement it.
>> > > >> >
>> > > >> > The login parameters required for authentication such as html
>> > form-id,
>> > > >> > login post data(username, password) are specified as key-value
>> pairs
>> > > in a
>> > > >> > configuration file. What is required to identify the html login
>> > > form(id
>> > > >> or
>> > > >> > name of the html form)? How to identify the html form parameters
>> if
>> > > id or
>> > > >> > name of the form is not specified?
>> > > >> >
>> > > >> > I have also posted the question to the developer mailing list,
>> but
>> > did
>> > > >> not
>> > > >> > receive any reply.I am stuck with this for a while. Could
>> somebody
>> > > >> provide
>> > > >> > with a solution on how to specify the html form parameters of
>> > > websites to
>> > > >> > be crawled to perform form based authentication?
>> > > >> >
>> > > >> > Thanks and Regards,
>> > > >> > Tizy
>> > > >> >
>> > > >>
>> > > >
>> > > >
>> > > > --
>> > > > Thanks and Regards,
>> > > > Tizy
>> > >
>> > >
>> > >
>> > > --
>> > > Talat UYARER
>> > > Websitesi: http://talat.uyarer.com
>> > > Twitter: http://twitter.com/talatuyarer
>> > > Linkedin: http://tr.linkedin.com/pub/talat-uyarer/10/142/304
>> > >
>> >
>> >
>> > --
>> > Thanks and Regards,
>> > Tizy
>> >
>>
>
>
> --
> Thanks and Regards,
> Tizy
>
>
>