You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Koe Black <ko...@yahoo.com> on 2007/08/30 17:10:19 UTC

ability to crawl password protected site

Hello,

I was not able to find answer to this questions.

Can nutch crawler be configured to 
1. crawl password protected site. i.e. Can I configure
crawler in such a way that it knows what user name and
password to enter on protected page before crawling
protected part of the site.
2. Site with java script agreement button. Can i
configure crawler to click on agreement/disclaimer
button before starting the crawl of the site which
requires agreement/disclaimer.

for example, I know that google appliance boxes can do
that.

Thank you


       
____________________________________________________________________________________
Sick sense of humor? Visit Yahoo! TV's 
Comedy with an Edge to see what's on, when. 
http://tv.yahoo.com/collections/222

Re: opensearch error nutch 9

Posted by Bud Witney <wi...@osu.edu>.
found the following hack to fix

http://forum.java.sun.com/thread.jspa? 
tstart=30&forumID=34&threadID=542044&trange=15

-Bud


On Aug 30, 2007, at 3:40 PM, Brian Ulicny wrote:

> Yes, you need to get a copy of the xalan jar and put it in your lib
> directory.
>
> BU
> On Thu, 30 Aug 2007 15:23:49 -0400, "Bud Witney" <wi...@osu.edu>
> said:
>> Just updated Fedore 6 to latest versions and
>>
>> get the following when using opensearch RSS
>>
>> HTTP Status 500 -
>>
>> type Exception report
>>
>> message
>>
>> description The server encountered an internal error () that
>> prevented it from fulfilling this request.
>>
>> exception
>>
>> javax.servlet.ServletException: Servlet execution threw an exception
>>
>> root cause
>>
>> javax.xml.transform.TransformerFactoryConfigurationError: Provider
>> org.apache.xalan.processor.TransformerFactoryImpl not found
>> 	javax.xml.transform.TransformerFactory.newInstance(Unknown Source)
>> 	org.apache.nutch.searcher.OpenSearchServlet.doGet
>> (OpenSearchServlet.java:250)
>> 	javax.servlet.http.HttpServlet.service(HttpServlet.java:690)
>> 	javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
>>
>> note The full stack trace of the root cause is available in the
>> Apache Tomcat/5.5.23 logs.
>>
>>
>> ***** Any ideas how to fix *********
>>
>> -Bud
>>
>>
> -- 
>   Brian Ulicny
>   bulicny at alum dot mit dot edu
>   home: 781-721-5746
>   fax: 360-361-5746
>
>
>
>
> -- 
> BEGIN-ANTISPAM-VOTING-LINKS
> ------------------------------------------------------
>
> Teach CanIt if this mail (ID 406208321) is spam:
> Spam:        https://antispam.osu.edu/b.php? 
> c=s&i=406208321&m=5b7340ace55e
> Not spam:    https://antispam.osu.edu/b.php? 
> c=n&i=406208321&m=5b7340ace55e
> Forget vote: https://antispam.osu.edu/b.php? 
> c=f&i=406208321&m=5b7340ace55e
> ------------------------------------------------------
> END-ANTISPAM-VOTING-LINKS
>


Re: opensearch error nutch 9

Posted by Brian Ulicny <bu...@alum.mit.edu>.
Yes, you need to get a copy of the xalan jar and put it in your lib
directory.

BU
On Thu, 30 Aug 2007 15:23:49 -0400, "Bud Witney" <wi...@osu.edu>
said:
> Just updated Fedore 6 to latest versions and
> 
> get the following when using opensearch RSS
> 
> HTTP Status 500 -
> 
> type Exception report
> 
> message
> 
> description The server encountered an internal error () that  
> prevented it from fulfilling this request.
> 
> exception
> 
> javax.servlet.ServletException: Servlet execution threw an exception
> 
> root cause
> 
> javax.xml.transform.TransformerFactoryConfigurationError: Provider  
> org.apache.xalan.processor.TransformerFactoryImpl not found
> 	javax.xml.transform.TransformerFactory.newInstance(Unknown Source)
> 	org.apache.nutch.searcher.OpenSearchServlet.doGet 
> (OpenSearchServlet.java:250)
> 	javax.servlet.http.HttpServlet.service(HttpServlet.java:690)
> 	javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
> 
> note The full stack trace of the root cause is available in the  
> Apache Tomcat/5.5.23 logs.
> 
> 
> ***** Any ideas how to fix *********
> 
> -Bud
> 
> 
-- 
  Brian Ulicny
  bulicny at alum dot mit dot edu
  home: 781-721-5746
  fax: 360-361-5746



opensearch error nutch 9

Posted by Bud Witney <wi...@osu.edu>.
Just updated Fedore 6 to latest versions and

get the following when using opensearch RSS

HTTP Status 500 -

type Exception report

message

description The server encountered an internal error () that  
prevented it from fulfilling this request.

exception

javax.servlet.ServletException: Servlet execution threw an exception

root cause

javax.xml.transform.TransformerFactoryConfigurationError: Provider  
org.apache.xalan.processor.TransformerFactoryImpl not found
	javax.xml.transform.TransformerFactory.newInstance(Unknown Source)
	org.apache.nutch.searcher.OpenSearchServlet.doGet 
(OpenSearchServlet.java:250)
	javax.servlet.http.HttpServlet.service(HttpServlet.java:690)
	javax.servlet.http.HttpServlet.service(HttpServlet.java:803)

note The full stack trace of the root cause is available in the  
Apache Tomcat/5.5.23 logs.


***** Any ideas how to fix *********

-Bud