You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Mohini Padhye <mp...@internap.com> on 2005/10/25 04:41:35 UTC

Problem crawling a site when url contains spaces

Hi,

I get several of the following errors while doing an intranet crawl....

fetch of http://www.mysite.com/About <http://www.mysite.com/About>
Site/Case Studies/page1419.html failed with:
net.nutch.protocol.http.HttpError: HTTP Error: 4 00 

The reason for this is that the url contains spaces (which is
represented as %20 in the url).=20 What is the solution for crawling a
website with url containing spaces?

Can I add some regex for this problem in regex-urlfilter.txt file?

Thanks,

Mohini    


Re: Problem crawling a site when url contains spaces

Posted by Ben <ne...@gmail.com>.
Hi

I had this problem and I applied the escapeWhitespace method from
Heritrix 1.4.0 to the HttpResponse class. Here is the full code for
the method:

  /**
   * Borrowed from Heritrix 1.4.0 and changed MutableString to
StringBuilder. If you're using
   * JDK 1.4, change StringBuilder to StringBuffer.
   **/
  protected String escapeWhitespace(String uri) {
      // Just write a new string anyways.  The perl '\s' is not
      // as inclusive as the Character.isWhitespace so there are
      // whitespace characters we could miss.  So, rather than
      // write some awkward regex, just go through the string
      // a character at a time.  Only create buffer first time
      // we find a space.
      StringBuilder buffer = null;
      for (int i = 0; i < uri.length(); i++) {
          char c = uri.charAt(i);
          if (Character.isWhitespace(c)) {
              if (buffer == null) {
                  buffer = new StringBuilder(uri.length() +
                      2 /*If space, two extra characters (at least)*/);
                  buffer.append(uri.substring(0, i));
              }
              buffer.append("%");
              String hexStr = Integer.toHexString(c);
              if ((hexStr.length() % 2) > 0) {
                  buffer.append("0");
              }
              buffer.append(hexStr);

          } else {
              if (buffer != null) {
                  buffer.append(c);
              }
          }
      }
      return (buffer !=  null)? buffer.toString(): uri;
  }

In HttpResponse.java, replace:

GetMethod get = new GetMethod(this.orig);

with:

GetMethod get = new GetMethod(escapeWhitespace(this.orig));


HTH
-Ben

On 10/25/05, Mohini Padhye <mp...@internap.com> wrote:
> Hi,
>
> I get several of the following errors while doing an intranet crawl....
>
> fetch of http://www.mysite.com/About <http://www.mysite.com/About>
> Site/Case Studies/page1419.html failed with:
> net.nutch.protocol.http.HttpError: HTTP Error: 4 00
>
> The reason for this is that the url contains spaces (which is
> represented as %20 in the url).=20 What is the solution for crawling a
> website with url containing spaces?
>
> Can I add some regex for this problem in regex-urlfilter.txt file?
>
> Thanks,
>
> Mohini
>
>
>