You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Susheel Kumar <su...@gmail.com> on 2019/07/02 16:23:05 UTC

IllegalArgumentException: No form exists: user-login-form

Hello Nutch Users,

I am a first time Nutch user and been trying to crawl an intranet
portal *https://pilot.mysite.sitecorp.com/user/login
<https://pilot.mysite.sitecorp.com/user/login>*  using Nutch 1.15 and I am
always getting below "No form exists: user-login-form" error.  I tried
crawling other login page like https://urs.earthdata.nasa.gov/ and do not
see such error but for this intranet site I am always getting this error.

I tried crawling the same url/login page using Selenium Chrome Drive and it
does load and fill in the user id/pwd text boxes.

What could be wrong.  How can i further troubleshoot this?

Thanks in advance.

 2019-07-02 10:36:59,152 DEBUG httpclient.HttpMethodBase - Resorting to
protocol version default close connection policy
2019-07-02 10:36:59,153 DEBUG httpclient.HttpMethodBase - Should NOT close
connection, using HTTP/1.1
2019-07-02 10:36:59,153 TRACE httpclient.HttpConnection - enter
HttpConnection.isResponseAvailable()
2019-07-02 10:36:59,153 TRACE httpclient.HttpConnection - enter
HttpConnection.releaseConnection()
2019-07-02 10:36:59,153 DEBUG httpclient.HttpConnection - Releasing
connection back to connection manager.
2019-07-02 10:36:59,153 TRACE httpclient.MultiThreadedHttpConnectionManager
- enter HttpConnectionManager.releaseConnection(HttpConnection)
2019-07-02 10:36:59,153 DEBUG httpclient.MultiThreadedHttpConnectionManager
- Freeing connection, hostConfig=HostConfiguration[host=
https://pilot.mysite.sitecorp.com]
2019-07-02 10:36:59,153 TRACE httpclient.MultiThreadedHttpConnectionManager
- enter HttpConnectionManager.ConnectionPool.getHostPool(HostConfiguration)
2019-07-02 10:36:59,153 DEBUG util.IdleConnectionHandler - Adding
connection at: 1562078219153
2019-07-02 10:36:59,153 DEBUG httpclient.MultiThreadedHttpConnectionManager
- Notifying no-one, there are no waiting threads
2019-07-02 10:36:59,202 DEBUG httpclient.HttpFormAuthentication - No form
element found with 'id' = user-login-form, trying 'name'.
2019-07-02 10:36:59,205 DEBUG httpclient.HttpFormAuthentication - No form
element found with 'name' = user-login-form
2019-07-02 10:36:59,205 ERROR httpclient.Http - Failed to get protocol
output
java.lang.RuntimeException: java.lang.IllegalArgumentException: No form
exists: user-login-form
        at
org.apache.nutch.protocol.httpclient.Http.resolveCredentials(Http.java:500)
        at
org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:177)
        at
org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:320)
        at
org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:343)
Caused by: java.lang.IllegalArgumentException: No form exists:
user-login-form
        at
org.apache.nutch.protocol.httpclient.HttpFormAuthentication.getLoginFormParams(HttpFormAuthentication.java:219)
        at
org.apache.nutch.protocol.httpclient.HttpFormAuthentication.login(HttpFormAuthentication.java:95)
        at
org.apache.nutch.protocol.httpclient.Http.resolveCredentials(Http.java:498)
        ... 3 more
2019-07-02 10:36:59,209 INFO  fetcher.FetcherThread - FetcherThread 41
fetch of https://pilot.mysite.sitecorp.com/user/login failed with:
java.lang.RuntimeException: java.lang.IllegalArgumentException: No form
exists: user-login-form
2019-07-02 10:36:59,210 INFO  fetcher.FetcherThread - FetcherThread 41 has
no more work available
2019-07-02 10:36:59,210 INFO  fetcher.FetcherThread - FetcherThread 41
-finishing thread FetcherThread, activeThreads=0
2019-07-02 10:36:59,215 INFO  mapreduce.Job - Job job_local487279790_0001
running in uber mode : false
2019-07-02 10:36:59,216 INFO  mapreduce.Job -  map 0% reduce 0%
2019-07-02 10:36:59,635 INFO  fetcher.Fetcher - -activeThreads=0,
spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0
2019-07-02 10:36:59,635 INFO  fetcher.Fetcher - -activeThreads=0
2019-07-02 10:37:00,218 INFO  mapreduce.Job -  map 100% reduce 100%
2019-07-02 10:37:00,218 INFO  mapreduce.Job - Job job_local487279790_0001
completed successfully

Re: IllegalArgumentException: No form exists: user-login-form

Posted by Sebastian Nagel <wa...@googlemail.com.INVALID>.
> What could be going wrong with actual site?  How can i debug/troubleshoot
> further?

Make sure the HTML source code contains the correct <form> element:
- do not use a browser (or disable JavaScript)
- use curl or wget instead download the page
- always be aware that the DOM tree in a web browser may look different
  than that from parsing the bare HTML page
- obviously, Dropbox isn't an appropriate host for testing and debugging

Good luck!

Sebastian

On 7/10/19 4:21 AM, Susheel Kumar wrote:
> It looks like when i run the html page from my local tomcat
> http://localhost:8082/mysite/ I am not getting the "no form exist" error.
> 
> What could be going wrong with actual site?  How can i debug/troubleshoot
> further?
> 
> Thanks,
> Susheel
> 
> On Tue, Jul 9, 2019 at 10:08 PM Susheel Kumar <su...@gmail.com> wrote:
> 
>> Thanks for the idea Sebastian.  Let me try that.
>>
>>
>> On Tue, Jul 9, 2019 at 10:15 AM Sebastian Nagel
>> <wa...@googlemail.com.invalid> wrote:
>>
>>> Hi Ryan,
>>>
>>> there is one:
>>>
>>>   <form class="user-login-form" data-drupal-selector="user-login-form"
>>> action="/user/login"
>>> method="post" id="user-login-form" accept-charset="UTF-8">
>>>
>>> But you would need to copy the content out from dropbox, put the page on
>>> your own server
>>> and try it.
>>>
>>> Best,
>>> Sebastian
>>>
>>> On 7/9/19 3:21 PM, Ryan Suarez wrote:
>>>> ok, so the error message is quite clear.  There is no form on that link
>>>> you provided with an id or name of 'user-login-form'.
>>>>
>>>> On Mon, 2019-07-08 at 22:39 -0400, Susheel Kumar wrote:
>>>>> Hello Sebastian,
>>>>>
>>>>> Thanks for getting back.  Here is the Login.html link which is
>>>>> throwing no
>>>>> form exists error.
>>>>>
>>>>> https://www.dropbox.com/s/jkts0eogarfs03j/Log%20in%20.html?dl=0
>>>>>
>>>>> Please take a look and suggest what could be wrong when trying to
>>>>> sign in
>>>>> to this site.
>>>>>
>>>>> Also below content of auth-configuration section of httpclient-
>>>>> auth.xml
>>>>>
>>>>> ---
>>>>>  <credentials authMethod="formAuth"
>>>>>                 loginUrl="https://qa.mysite.sitecorp.com/user/login"
>>>>>                 loginFormId="user-login-form"
>>>>>                 loginRedirect="false">
>>>>>      <loginPostData>
>>>>>        <field name="name"
>>>>>               value="Crawler"/>
>>>>>        <field name="pass"
>>>>>               value="spid3r_us"/>
>>>>>      </loginPostData>
>>>>>      <additionalPostHeaders>
>>>>>        <field name="User-Agent"
>>>>>               value="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3)
>>>>> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100
>>>>> Safari/537.36"
>>>>> />
>>>>>      </additionalPostHeaders>
>>>>>      <removedFormFields>
>>>>>        <field name="ctl00$MainContent$LoginUser$RememberMe"/>
>>>>>      </removedFormFields>
>>>>>      <loginCookie>
>>>>>        <policy>BROWSER_COMPATIBILITY</policy>
>>>>>      </loginCookie>
>>>>>    </credentials>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Jul 3, 2019 at 10:22 AM Sebastian Nagel
>>>>> <wa...@googlemail.com.invalid> wrote:
>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> the error message is quite clear:
>>>>>>
>>>>>>> 2019-07-02 10:36:59,202 DEBUG httpclient.HttpFormAuthentication -
>>>>>>> No form
>>>>>>> element found with 'id' = user-login-form, trying 'name'.
>>>>>>> 2019-07-02 10:36:59,205 DEBUG httpclient.HttpFormAuthentication -
>>>>>>> No form
>>>>>>> element found with 'name' = user-login-form
>>>>>>
>>>>>> But without access to the login page content, it's nearly
>>>>>> impossible to
>>>>>> determine
>>>>>> what's going wrong.
>>>>>>
>>>>>>
>>>>>>> I tried crawling the same url/login page using Selenium Chrome
>>>>>>> Drive and
>>>>>>
>>>>>> it
>>>>>>> does load and fill in the user id/pwd text boxes.
>>>>>>
>>>>>> Sounds like the page HTML source looks different with Selenium.
>>>>>> Note that
>>>>>> the
>>>>>> protocol-httpclient does not modify the DOM tree via Javascript, it
>>>>>> is
>>>>>> derived
>>>>>> from the bare HTML only.  That could be a reason why the form
>>>>>> element is
>>>>>> not found
>>>>>> while it works in a browser (emulation).
>>>>>>
>>>>>>
>>>>>> Best,
>>>>>> Sebastian
>>>>>>
>>>
>>>
> 


Re: IllegalArgumentException: No form exists: user-login-form

Posted by Susheel Kumar <su...@gmail.com>.
It looks like when i run the html page from my local tomcat
http://localhost:8082/mysite/ I am not getting the "no form exist" error.

What could be going wrong with actual site?  How can i debug/troubleshoot
further?

Thanks,
Susheel

On Tue, Jul 9, 2019 at 10:08 PM Susheel Kumar <su...@gmail.com> wrote:

> Thanks for the idea Sebastian.  Let me try that.
>
>
> On Tue, Jul 9, 2019 at 10:15 AM Sebastian Nagel
> <wa...@googlemail.com.invalid> wrote:
>
>> Hi Ryan,
>>
>> there is one:
>>
>>   <form class="user-login-form" data-drupal-selector="user-login-form"
>> action="/user/login"
>> method="post" id="user-login-form" accept-charset="UTF-8">
>>
>> But you would need to copy the content out from dropbox, put the page on
>> your own server
>> and try it.
>>
>> Best,
>> Sebastian
>>
>> On 7/9/19 3:21 PM, Ryan Suarez wrote:
>> > ok, so the error message is quite clear.  There is no form on that link
>> > you provided with an id or name of 'user-login-form'.
>> >
>> > On Mon, 2019-07-08 at 22:39 -0400, Susheel Kumar wrote:
>> >> Hello Sebastian,
>> >>
>> >> Thanks for getting back.  Here is the Login.html link which is
>> >> throwing no
>> >> form exists error.
>> >>
>> >> https://www.dropbox.com/s/jkts0eogarfs03j/Log%20in%20.html?dl=0
>> >>
>> >> Please take a look and suggest what could be wrong when trying to
>> >> sign in
>> >> to this site.
>> >>
>> >> Also below content of auth-configuration section of httpclient-
>> >> auth.xml
>> >>
>> >> ---
>> >>  <credentials authMethod="formAuth"
>> >>                 loginUrl="https://qa.mysite.sitecorp.com/user/login"
>> >>                 loginFormId="user-login-form"
>> >>                 loginRedirect="false">
>> >>      <loginPostData>
>> >>        <field name="name"
>> >>               value="Crawler"/>
>> >>        <field name="pass"
>> >>               value="spid3r_us"/>
>> >>      </loginPostData>
>> >>      <additionalPostHeaders>
>> >>        <field name="User-Agent"
>> >>               value="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3)
>> >> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100
>> >> Safari/537.36"
>> >> />
>> >>      </additionalPostHeaders>
>> >>      <removedFormFields>
>> >>        <field name="ctl00$MainContent$LoginUser$RememberMe"/>
>> >>      </removedFormFields>
>> >>      <loginCookie>
>> >>        <policy>BROWSER_COMPATIBILITY</policy>
>> >>      </loginCookie>
>> >>    </credentials>
>> >>
>> >>
>> >>
>> >>
>> >> On Wed, Jul 3, 2019 at 10:22 AM Sebastian Nagel
>> >> <wa...@googlemail.com.invalid> wrote:
>> >>
>> >>> Hi,
>> >>>
>> >>> the error message is quite clear:
>> >>>
>> >>>> 2019-07-02 10:36:59,202 DEBUG httpclient.HttpFormAuthentication -
>> >>>> No form
>> >>>> element found with 'id' = user-login-form, trying 'name'.
>> >>>> 2019-07-02 10:36:59,205 DEBUG httpclient.HttpFormAuthentication -
>> >>>> No form
>> >>>> element found with 'name' = user-login-form
>> >>>
>> >>> But without access to the login page content, it's nearly
>> >>> impossible to
>> >>> determine
>> >>> what's going wrong.
>> >>>
>> >>>
>> >>>> I tried crawling the same url/login page using Selenium Chrome
>> >>>> Drive and
>> >>>
>> >>> it
>> >>>> does load and fill in the user id/pwd text boxes.
>> >>>
>> >>> Sounds like the page HTML source looks different with Selenium.
>> >>> Note that
>> >>> the
>> >>> protocol-httpclient does not modify the DOM tree via Javascript, it
>> >>> is
>> >>> derived
>> >>> from the bare HTML only.  That could be a reason why the form
>> >>> element is
>> >>> not found
>> >>> while it works in a browser (emulation).
>> >>>
>> >>>
>> >>> Best,
>> >>> Sebastian
>> >>>
>>
>>

Re: IllegalArgumentException: No form exists: user-login-form

Posted by Susheel Kumar <su...@gmail.com>.
Thanks for the idea Sebastian.  Let me try that.


On Tue, Jul 9, 2019 at 10:15 AM Sebastian Nagel
<wa...@googlemail.com.invalid> wrote:

> Hi Ryan,
>
> there is one:
>
>   <form class="user-login-form" data-drupal-selector="user-login-form"
> action="/user/login"
> method="post" id="user-login-form" accept-charset="UTF-8">
>
> But you would need to copy the content out from dropbox, put the page on
> your own server
> and try it.
>
> Best,
> Sebastian
>
> On 7/9/19 3:21 PM, Ryan Suarez wrote:
> > ok, so the error message is quite clear.  There is no form on that link
> > you provided with an id or name of 'user-login-form'.
> >
> > On Mon, 2019-07-08 at 22:39 -0400, Susheel Kumar wrote:
> >> Hello Sebastian,
> >>
> >> Thanks for getting back.  Here is the Login.html link which is
> >> throwing no
> >> form exists error.
> >>
> >> https://www.dropbox.com/s/jkts0eogarfs03j/Log%20in%20.html?dl=0
> >>
> >> Please take a look and suggest what could be wrong when trying to
> >> sign in
> >> to this site.
> >>
> >> Also below content of auth-configuration section of httpclient-
> >> auth.xml
> >>
> >> ---
> >>  <credentials authMethod="formAuth"
> >>                 loginUrl="https://qa.mysite.sitecorp.com/user/login"
> >>                 loginFormId="user-login-form"
> >>                 loginRedirect="false">
> >>      <loginPostData>
> >>        <field name="name"
> >>               value="Crawler"/>
> >>        <field name="pass"
> >>               value="spid3r_us"/>
> >>      </loginPostData>
> >>      <additionalPostHeaders>
> >>        <field name="User-Agent"
> >>               value="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3)
> >> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100
> >> Safari/537.36"
> >> />
> >>      </additionalPostHeaders>
> >>      <removedFormFields>
> >>        <field name="ctl00$MainContent$LoginUser$RememberMe"/>
> >>      </removedFormFields>
> >>      <loginCookie>
> >>        <policy>BROWSER_COMPATIBILITY</policy>
> >>      </loginCookie>
> >>    </credentials>
> >>
> >>
> >>
> >>
> >> On Wed, Jul 3, 2019 at 10:22 AM Sebastian Nagel
> >> <wa...@googlemail.com.invalid> wrote:
> >>
> >>> Hi,
> >>>
> >>> the error message is quite clear:
> >>>
> >>>> 2019-07-02 10:36:59,202 DEBUG httpclient.HttpFormAuthentication -
> >>>> No form
> >>>> element found with 'id' = user-login-form, trying 'name'.
> >>>> 2019-07-02 10:36:59,205 DEBUG httpclient.HttpFormAuthentication -
> >>>> No form
> >>>> element found with 'name' = user-login-form
> >>>
> >>> But without access to the login page content, it's nearly
> >>> impossible to
> >>> determine
> >>> what's going wrong.
> >>>
> >>>
> >>>> I tried crawling the same url/login page using Selenium Chrome
> >>>> Drive and
> >>>
> >>> it
> >>>> does load and fill in the user id/pwd text boxes.
> >>>
> >>> Sounds like the page HTML source looks different with Selenium.
> >>> Note that
> >>> the
> >>> protocol-httpclient does not modify the DOM tree via Javascript, it
> >>> is
> >>> derived
> >>> from the bare HTML only.  That could be a reason why the form
> >>> element is
> >>> not found
> >>> while it works in a browser (emulation).
> >>>
> >>>
> >>> Best,
> >>> Sebastian
> >>>
>
>

Re: IllegalArgumentException: No form exists: user-login-form

Posted by Sebastian Nagel <wa...@googlemail.com.INVALID>.
Hi Ryan,

there is one:

  <form class="user-login-form" data-drupal-selector="user-login-form" action="/user/login"
method="post" id="user-login-form" accept-charset="UTF-8">

But you would need to copy the content out from dropbox, put the page on your own server
and try it.

Best,
Sebastian

On 7/9/19 3:21 PM, Ryan Suarez wrote:
> ok, so the error message is quite clear.  There is no form on that link
> you provided with an id or name of 'user-login-form'.
> 
> On Mon, 2019-07-08 at 22:39 -0400, Susheel Kumar wrote:
>> Hello Sebastian,
>>
>> Thanks for getting back.  Here is the Login.html link which is
>> throwing no
>> form exists error.
>>
>> https://www.dropbox.com/s/jkts0eogarfs03j/Log%20in%20.html?dl=0
>>
>> Please take a look and suggest what could be wrong when trying to
>> sign in
>> to this site.
>>
>> Also below content of auth-configuration section of httpclient-
>> auth.xml
>>
>> ---
>>  <credentials authMethod="formAuth"
>>                 loginUrl="https://qa.mysite.sitecorp.com/user/login"
>>                 loginFormId="user-login-form"
>>                 loginRedirect="false">
>>      <loginPostData>
>>        <field name="name"
>>               value="Crawler"/>
>>        <field name="pass"
>>               value="spid3r_us"/>
>>      </loginPostData>
>>      <additionalPostHeaders>
>>        <field name="User-Agent"
>>               value="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3)
>> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100
>> Safari/537.36"
>> />
>>      </additionalPostHeaders>
>>      <removedFormFields>
>>        <field name="ctl00$MainContent$LoginUser$RememberMe"/>
>>      </removedFormFields>
>>      <loginCookie>
>>        <policy>BROWSER_COMPATIBILITY</policy>
>>      </loginCookie>
>>    </credentials>
>>
>>
>>
>>
>> On Wed, Jul 3, 2019 at 10:22 AM Sebastian Nagel
>> <wa...@googlemail.com.invalid> wrote:
>>
>>> Hi,
>>>
>>> the error message is quite clear:
>>>
>>>> 2019-07-02 10:36:59,202 DEBUG httpclient.HttpFormAuthentication -
>>>> No form
>>>> element found with 'id' = user-login-form, trying 'name'.
>>>> 2019-07-02 10:36:59,205 DEBUG httpclient.HttpFormAuthentication -
>>>> No form
>>>> element found with 'name' = user-login-form
>>>
>>> But without access to the login page content, it's nearly
>>> impossible to
>>> determine
>>> what's going wrong.
>>>
>>>
>>>> I tried crawling the same url/login page using Selenium Chrome
>>>> Drive and
>>>
>>> it
>>>> does load and fill in the user id/pwd text boxes.
>>>
>>> Sounds like the page HTML source looks different with Selenium.
>>> Note that
>>> the
>>> protocol-httpclient does not modify the DOM tree via Javascript, it
>>> is
>>> derived
>>> from the bare HTML only.  That could be a reason why the form
>>> element is
>>> not found
>>> while it works in a browser (emulation).
>>>
>>>
>>> Best,
>>> Sebastian
>>>


Re: IllegalArgumentException: No form exists: user-login-form

Posted by Ryan Suarez <ry...@sheridancollege.ca>.
ok, so the error message is quite clear.  There is no form on that link
you provided with an id or name of 'user-login-form'.

On Mon, 2019-07-08 at 22:39 -0400, Susheel Kumar wrote:
> Hello Sebastian,
> 
> Thanks for getting back.  Here is the Login.html link which is
> throwing no
> form exists error.
> 
> https://www.dropbox.com/s/jkts0eogarfs03j/Log%20in%20.html?dl=0
> 
> Please take a look and suggest what could be wrong when trying to
> sign in
> to this site.
> 
> Also below content of auth-configuration section of httpclient-
> auth.xml
> 
> ---
>  <credentials authMethod="formAuth"
>                 loginUrl="https://qa.mysite.sitecorp.com/user/login"
>                 loginFormId="user-login-form"
>                 loginRedirect="false">
>      <loginPostData>
>        <field name="name"
>               value="Crawler"/>
>        <field name="pass"
>               value="spid3r_us"/>
>      </loginPostData>
>      <additionalPostHeaders>
>        <field name="User-Agent"
>               value="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3)
> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100
> Safari/537.36"
> />
>      </additionalPostHeaders>
>      <removedFormFields>
>        <field name="ctl00$MainContent$LoginUser$RememberMe"/>
>      </removedFormFields>
>      <loginCookie>
>        <policy>BROWSER_COMPATIBILITY</policy>
>      </loginCookie>
>    </credentials>
> 
> 
> 
> 
> On Wed, Jul 3, 2019 at 10:22 AM Sebastian Nagel
> <wa...@googlemail.com.invalid> wrote:
> 
> > Hi,
> > 
> > the error message is quite clear:
> > 
> > > 2019-07-02 10:36:59,202 DEBUG httpclient.HttpFormAuthentication -
> > > No form
> > > element found with 'id' = user-login-form, trying 'name'.
> > > 2019-07-02 10:36:59,205 DEBUG httpclient.HttpFormAuthentication -
> > > No form
> > > element found with 'name' = user-login-form
> > 
> > But without access to the login page content, it's nearly
> > impossible to
> > determine
> > what's going wrong.
> > 
> > 
> > > I tried crawling the same url/login page using Selenium Chrome
> > > Drive and
> > 
> > it
> > > does load and fill in the user id/pwd text boxes.
> > 
> > Sounds like the page HTML source looks different with Selenium.
> > Note that
> > the
> > protocol-httpclient does not modify the DOM tree via Javascript, it
> > is
> > derived
> > from the bare HTML only.  That could be a reason why the form
> > element is
> > not found
> > while it works in a browser (emulation).
> > 
> > 
> > Best,
> > Sebastian
> > 

Re: IllegalArgumentException: No form exists: user-login-form

Posted by Susheel Kumar <su...@gmail.com>.
Hello Sebastian,

Thanks for getting back.  Here is the Login.html link which is throwing no
form exists error.

https://www.dropbox.com/s/jkts0eogarfs03j/Log%20in%20.html?dl=0

Please take a look and suggest what could be wrong when trying to sign in
to this site.

Also below content of auth-configuration section of httpclient-auth.xml

---
 <credentials authMethod="formAuth"
                loginUrl="https://qa.mysite.sitecorp.com/user/login"
                loginFormId="user-login-form"
                loginRedirect="false">
     <loginPostData>
       <field name="name"
              value="Crawler"/>
       <field name="pass"
              value="spid3r_us"/>
     </loginPostData>
     <additionalPostHeaders>
       <field name="User-Agent"
              value="Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_3)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Safari/537.36"
/>
     </additionalPostHeaders>
     <removedFormFields>
       <field name="ctl00$MainContent$LoginUser$RememberMe"/>
     </removedFormFields>
     <loginCookie>
       <policy>BROWSER_COMPATIBILITY</policy>
     </loginCookie>
   </credentials>




On Wed, Jul 3, 2019 at 10:22 AM Sebastian Nagel
<wa...@googlemail.com.invalid> wrote:

> Hi,
>
> the error message is quite clear:
>
> > 2019-07-02 10:36:59,202 DEBUG httpclient.HttpFormAuthentication - No form
> > element found with 'id' = user-login-form, trying 'name'.
> > 2019-07-02 10:36:59,205 DEBUG httpclient.HttpFormAuthentication - No form
> > element found with 'name' = user-login-form
>
> But without access to the login page content, it's nearly impossible to
> determine
> what's going wrong.
>
>
> > I tried crawling the same url/login page using Selenium Chrome Drive and
> it
> > does load and fill in the user id/pwd text boxes.
>
> Sounds like the page HTML source looks different with Selenium. Note that
> the
> protocol-httpclient does not modify the DOM tree via Javascript, it is
> derived
> from the bare HTML only.  That could be a reason why the form element is
> not found
> while it works in a browser (emulation).
>
>
> Best,
> Sebastian
>

Re: IllegalArgumentException: No form exists: user-login-form

Posted by Sebastian Nagel <wa...@googlemail.com.INVALID>.
Hi,

the error message is quite clear:

> 2019-07-02 10:36:59,202 DEBUG httpclient.HttpFormAuthentication - No form
> element found with 'id' = user-login-form, trying 'name'.
> 2019-07-02 10:36:59,205 DEBUG httpclient.HttpFormAuthentication - No form
> element found with 'name' = user-login-form

But without access to the login page content, it's nearly impossible to determine
what's going wrong.


> I tried crawling the same url/login page using Selenium Chrome Drive and it
> does load and fill in the user id/pwd text boxes.

Sounds like the page HTML source looks different with Selenium. Note that the
protocol-httpclient does not modify the DOM tree via Javascript, it is derived
from the bare HTML only.  That could be a reason why the form element is not found
while it works in a browser (emulation).


Best,
Sebastian

Re: IllegalArgumentException: No form exists: user-login-form

Posted by Susheel Kumar <su...@gmail.com>.
Any insight into this error?

On Tue, Jul 2, 2019 at 12:23 PM Susheel Kumar <su...@gmail.com> wrote:

> Hello Nutch Users,
>
> I am a first time Nutch user and been trying to crawl an intranet portal *https://pilot.mysite.sitecorp.com/user/login
> <https://pilot.mysite.sitecorp.com/user/login>*  using Nutch 1.15 and I
> am always getting below "No form exists: user-login-form" error.  I tried
> crawling other login page like https://urs.earthdata.nasa.gov/ and do not
> see such error but for this intranet site I am always getting this error.
>
> I tried crawling the same url/login page using Selenium Chrome Drive and
> it does load and fill in the user id/pwd text boxes.
>
> What could be wrong.  How can i further troubleshoot this?
>
> Thanks in advance.
>
>  2019-07-02 10:36:59,152 DEBUG httpclient.HttpMethodBase - Resorting to
> protocol version default close connection policy
> 2019-07-02 10:36:59,153 DEBUG httpclient.HttpMethodBase - Should NOT close
> connection, using HTTP/1.1
> 2019-07-02 10:36:59,153 TRACE httpclient.HttpConnection - enter
> HttpConnection.isResponseAvailable()
> 2019-07-02 10:36:59,153 TRACE httpclient.HttpConnection - enter
> HttpConnection.releaseConnection()
> 2019-07-02 10:36:59,153 DEBUG httpclient.HttpConnection - Releasing
> connection back to connection manager.
> 2019-07-02 10:36:59,153 TRACE
> httpclient.MultiThreadedHttpConnectionManager - enter
> HttpConnectionManager.releaseConnection(HttpConnection)
> 2019-07-02 10:36:59,153 DEBUG
> httpclient.MultiThreadedHttpConnectionManager - Freeing connection,
> hostConfig=HostConfiguration[host=https://pilot.mysite.sitecorp.com]
> 2019-07-02 10:36:59,153 TRACE
> httpclient.MultiThreadedHttpConnectionManager - enter
> HttpConnectionManager.ConnectionPool.getHostPool(HostConfiguration)
> 2019-07-02 10:36:59,153 DEBUG util.IdleConnectionHandler - Adding
> connection at: 1562078219153
> 2019-07-02 10:36:59,153 DEBUG
> httpclient.MultiThreadedHttpConnectionManager - Notifying no-one, there are
> no waiting threads
> 2019-07-02 10:36:59,202 DEBUG httpclient.HttpFormAuthentication - No form
> element found with 'id' = user-login-form, trying 'name'.
> 2019-07-02 10:36:59,205 DEBUG httpclient.HttpFormAuthentication - No form
> element found with 'name' = user-login-form
> 2019-07-02 10:36:59,205 ERROR httpclient.Http - Failed to get protocol
> output
> java.lang.RuntimeException: java.lang.IllegalArgumentException: No form
> exists: user-login-form
>         at
> org.apache.nutch.protocol.httpclient.Http.resolveCredentials(Http.java:500)
>         at
> org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:177)
>         at
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:320)
>         at
> org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:343)
> Caused by: java.lang.IllegalArgumentException: No form exists:
> user-login-form
>         at
> org.apache.nutch.protocol.httpclient.HttpFormAuthentication.getLoginFormParams(HttpFormAuthentication.java:219)
>         at
> org.apache.nutch.protocol.httpclient.HttpFormAuthentication.login(HttpFormAuthentication.java:95)
>         at
> org.apache.nutch.protocol.httpclient.Http.resolveCredentials(Http.java:498)
>         ... 3 more
> 2019-07-02 10:36:59,209 INFO  fetcher.FetcherThread - FetcherThread 41
> fetch of https://pilot.mysite.sitecorp.com/user/login failed with:
> java.lang.RuntimeException: java.lang.IllegalArgumentException: No form
> exists: user-login-form
> 2019-07-02 10:36:59,210 INFO  fetcher.FetcherThread - FetcherThread 41 has
> no more work available
> 2019-07-02 10:36:59,210 INFO  fetcher.FetcherThread - FetcherThread 41
> -finishing thread FetcherThread, activeThreads=0
> 2019-07-02 10:36:59,215 INFO  mapreduce.Job - Job job_local487279790_0001
> running in uber mode : false
> 2019-07-02 10:36:59,216 INFO  mapreduce.Job -  map 0% reduce 0%
> 2019-07-02 10:36:59,635 INFO  fetcher.Fetcher - -activeThreads=0,
> spinWaiting=0, fetchQueues.totalSize=0, fetchQueues.getQueueCount=0
> 2019-07-02 10:36:59,635 INFO  fetcher.Fetcher - -activeThreads=0
> 2019-07-02 10:37:00,218 INFO  mapreduce.Job -  map 100% reduce 100%
> 2019-07-02 10:37:00,218 INFO  mapreduce.Job - Job job_local487279790_0001
> completed successfully
>