You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Mohini Padhye <mp...@internap.com> on 2005/10/25 04:41:35 UTC
Problem crawling a site when url contains spaces
Hi,
I get several of the following errors while doing an intranet crawl....
fetch of http://www.mysite.com/About <http://www.mysite.com/About>
Site/Case Studies/page1419.html failed with:
net.nutch.protocol.http.HttpError: HTTP Error: 4 00
The reason for this is that the url contains spaces (which is
represented as %20 in the url).=20 What is the solution for crawling a
website with url containing spaces?
Can I add some regex for this problem in regex-urlfilter.txt file?
Thanks,
Mohini
Re: Problem crawling a site when url contains spaces
Posted by Ben <ne...@gmail.com>.
Hi
I had this problem and I applied the escapeWhitespace method from
Heritrix 1.4.0 to the HttpResponse class. Here is the full code for
the method:
/**
* Borrowed from Heritrix 1.4.0 and changed MutableString to
StringBuilder. If you're using
* JDK 1.4, change StringBuilder to StringBuffer.
**/
protected String escapeWhitespace(String uri) {
// Just write a new string anyways. The perl '\s' is not
// as inclusive as the Character.isWhitespace so there are
// whitespace characters we could miss. So, rather than
// write some awkward regex, just go through the string
// a character at a time. Only create buffer first time
// we find a space.
StringBuilder buffer = null;
for (int i = 0; i < uri.length(); i++) {
char c = uri.charAt(i);
if (Character.isWhitespace(c)) {
if (buffer == null) {
buffer = new StringBuilder(uri.length() +
2 /*If space, two extra characters (at least)*/);
buffer.append(uri.substring(0, i));
}
buffer.append("%");
String hexStr = Integer.toHexString(c);
if ((hexStr.length() % 2) > 0) {
buffer.append("0");
}
buffer.append(hexStr);
} else {
if (buffer != null) {
buffer.append(c);
}
}
}
return (buffer != null)? buffer.toString(): uri;
}
In HttpResponse.java, replace:
GetMethod get = new GetMethod(this.orig);
with:
GetMethod get = new GetMethod(escapeWhitespace(this.orig));
HTH
-Ben
On 10/25/05, Mohini Padhye <mp...@internap.com> wrote:
> Hi,
>
> I get several of the following errors while doing an intranet crawl....
>
> fetch of http://www.mysite.com/About <http://www.mysite.com/About>
> Site/Case Studies/page1419.html failed with:
> net.nutch.protocol.http.HttpError: HTTP Error: 4 00
>
> The reason for this is that the url contains spaces (which is
> represented as %20 in the url).=20 What is the solution for crawling a
> website with url containing spaces?
>
> Can I add some regex for this problem in regex-urlfilter.txt file?
>
> Thanks,
>
> Mohini
>
>
>