You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by eyal edri <ey...@gmail.com> on 2007/11/13 16:30:14 UTC

Need help in updating url in runtime in [Fetcher.java]

Hello,

I'm trying to fix a nutch "bug" in fetcher.
I don't know if you've noticed, but if you try to fetch sites that doesn't
have the "www" prefix in thier url, such as http://inter.edu,
and these sites didn't register the domain http://inter.edu, but only
http://www.inter.edu, nutch fetch will fail. (i know it's not a bug, but i
would like it to act like that).

so i've written a hardcoded snippet in fetcher.java:

 public void run() {
      synchronized (Fetcher.this) {activeThreads++;} // count threads
      ...
      ... some code here....
 *            redirecting = false;
              Protocol protocol = this.protocolFactory.getProtocol(
url.toString());
              ProtocolOutput output = protocol.getProtocolOutput(url,
datum);
              ProtocolStatus status = output.getStatus();
              Content content = output.getContent();
              ParseStatus pstatus = null;
*
              // here comes my code  (the fetcher has thorwn an
UnknownHostException)
              if ( status.getCode() == ProtocolStatus.EXCEPTION )
              {
                  String urlPrefix = "";
                  String newurl = url.toString();
                  LOG.info ("this is newurl: " + newurl);
                  Pattern urlRegex =
Pattern.compile("http://([^\\.]*)\\.(.*)$");
                  Matcher urlMatcher = urlRegex.matcher(url.toString());
                  if (urlMatcher.find())
                     urlPrefix = urlMatcher.group(1);
                  newurl = newurl.replaceAll("http://" + urlPrefix, "
http://www." + urlPrefix);

                  // now i've come to the point that i have the new fixed
url with the WWW prefix,
                  // but i don't know how to update 'url' which is of type
Text(), without damaging the rest of the data it holds (like contentType)
                  // here is the command i don't know
                  url.set (newUrl); ???

                  status.setCode(ProtocolStatus.SUCCESS);
              }*

              switch(status.getCode()) {

              case ProtocolStatus.SUCCESS:        // got a page

              pstatus = output(url, datum, content, status, CrawlDatum.S*

             ..... some more code....

thank you,.

Eyal.

-- 
Eyal Edri