You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Andrzej Bialecki <ab...@getopt.org> on 2007/01/26 23:56:42 UTC

Re: Need help with form based authentication

sandeep pujar wrote:
> Greetings,
>
> Wanted to know if anybody had worked on form based
> authentication for the nutch crawler. 
>
> any pointers, suggestions would help.
>   

I have, without much success. Form-based authentication is different 
from site to site - most sites don't use just a plain form with 
username/password, but they use a wide variety of methods to check / 
protect the data being sent. In extreme cases forms will use an embedded 
challenge string, run a javascript-based md5 hash, and send only that 
... in other cases some other tricks are played, with setting cookies, 
redirecting, running javascripts, etc. In the end only perhaps 1 out of 
50 sites was using a plain form authentication, and even that with 
different field names on the form ... so I gave up.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: Need help with form based authentication

Posted by sandeep pujar <sa...@yahoo.com>.
Thank you for your reply Andrzej,

I need to set this up for a single sign-on form based
authentication. In that case what approach do you
suggest ?

I was trying to put together a solution using
Apache HttpClient.

Very similar to this
http://www.java-tips.org/other-api-tips/httpclient/how-to-perform-form-based-logon.html

Thanks !
Sandeep

--- Andrzej Bialecki <ab...@getopt.org> wrote:

> sandeep pujar wrote:
> > Greetings,
> >
> > Wanted to know if anybody had worked on form based
> > authentication for the nutch crawler. 
> >
> > any pointers, suggestions would help.
> >   
> 
> I have, without much success. Form-based
> authentication is different 
> from site to site - most sites don't use just a
> plain form with 
> username/password, but they use a wide variety of
> methods to check / 
> protect the data being sent. In extreme cases forms
> will use an embedded 
> challenge string, run a javascript-based md5 hash,
> and send only that 
> ... in other cases some other tricks are played,
> with setting cookies, 
> redirecting, running javascripts, etc. In the end
> only perhaps 1 out of 
> 50 sites was using a plain form authentication, and
> even that with 
> different field names on the form ... so I gave up.
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _  
> __________________________________
> [__ || __|__/|__||\/|  Information Retrieval,
> Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System
> Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
> 
> 
> 



 
____________________________________________________________________________________
Get your own web address.  
Have a HUGE year through Yahoo! Small Business.
http://smallbusiness.yahoo.com/domains/?p=BESTDEAL