You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by karthik085 <ka...@gmail.com> on 2008/07/14 23:49:17 UTC

Bypass Validation

Hi.

I am trying to crawl a page using nutch. That page exists behinds a
validator (struts), i.e. In order to get to the page, a button needs to be
clicked. Is there anyway this can be bypassed so web crawler can get to the
page without clicking this button?

Code:
<form name="loginForm" method="post" action="/check.do">
      <input type="hidden" name="forward" value="target_page">
       <input type="submit" name="org.apache.struts.taglib.html.CANCEL"
value="Continue" onclick="bCancel=true;">
 </form>

Any help is appreciated. Thanks.
-- 
View this message in context: http://www.nabble.com/Bypass-Validation-tp18453973p18453973.html
Sent from the Nutch - User mailing list archive at Nabble.com.


RE: Bypass Validation

Posted by Patrick Markiewicz <pm...@sim-gtech.com>.
Hi,
   Is there any way that you can create a url that gets beyond that page
without clicking a button?  I.e. can you type something like
http://form.example.com/check.do?forward=target_page&
org.apache.struts.taglib.html.CANCEL=Continue
In a web browser and view the page that is created by hitting the
button?

I'm no nutch expert, but if this button requires cookies to display that
next page, then you may need to use the http-client plugin instead of
the http plugin.  The problem with the http-client plugin is that all of
your original urls need to be escaped.  I.e. in your urls list, you
need:
http%3A//www.google.com
instead of
http://www.google.com

Patrick
-----Original Message-----
From: karthik085 [mailto:karthik085@gmail.com] 
Sent: Monday, July 14, 2008 5:49 PM
To: nutch-user@lucene.apache.org
Subject: Bypass Validation


Hi.

I am trying to crawl a page using nutch. That page exists behinds a
validator (struts), i.e. In order to get to the page, a button needs to
be
clicked. Is there anyway this can be bypassed so web crawler can get to
the
page without clicking this button?

Code:
<form name="loginForm" method="post" action="/check.do">
      <input type="hidden" name="forward" value="target_page">
       <input type="submit" name="org.apache.struts.taglib.html.CANCEL"
value="Continue" onclick="bCancel=true;">
 </form>

Any help is appreciated. Thanks.
-- 

Sent from the Nutch - User mailing list archive at Nabble.com.