You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by eyal edri <ey...@gmail.com> on 2007/11/13 16:30:14 UTC
Need help in updating url in runtime in [Fetcher.java]
Hello,
I'm trying to fix a nutch "bug" in fetcher.
I don't know if you've noticed, but if you try to fetch sites that doesn't
have the "www" prefix in thier url, such as http://inter.edu,
and these sites didn't register the domain http://inter.edu, but only
http://www.inter.edu, nutch fetch will fail. (i know it's not a bug, but i
would like it to act like that).
so i've written a hardcoded snippet in fetcher.java:
public void run() {
synchronized (Fetcher.this) {activeThreads++;} // count threads
...
... some code here....
* redirecting = false;
Protocol protocol = this.protocolFactory.getProtocol(
url.toString());
ProtocolOutput output = protocol.getProtocolOutput(url,
datum);
ProtocolStatus status = output.getStatus();
Content content = output.getContent();
ParseStatus pstatus = null;
*
// here comes my code (the fetcher has thorwn an
UnknownHostException)
if ( status.getCode() == ProtocolStatus.EXCEPTION )
{
String urlPrefix = "";
String newurl = url.toString();
LOG.info ("this is newurl: " + newurl);
Pattern urlRegex =
Pattern.compile("http://([^\\.]*)\\.(.*)$");
Matcher urlMatcher = urlRegex.matcher(url.toString());
if (urlMatcher.find())
urlPrefix = urlMatcher.group(1);
newurl = newurl.replaceAll("http://" + urlPrefix, "
http://www." + urlPrefix);
// now i've come to the point that i have the new fixed
url with the WWW prefix,
// but i don't know how to update 'url' which is of type
Text(), without damaging the rest of the data it holds (like contentType)
// here is the command i don't know
url.set (newUrl); ???
status.setCode(ProtocolStatus.SUCCESS);
}*
switch(status.getCode()) {
case ProtocolStatus.SUCCESS: // got a page
pstatus = output(url, datum, content, status, CrawlDatum.S*
..... some more code....
thank you,.
Eyal.
--
Eyal Edri