You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Sébastien LE CALLONNEC <sl...@yahoo.ie> on 2005/03/03 21:59:22 UTC
GZipped pages
Hi list,
(Sorry if this isn't the proper list to post this)
I was experimenting with nutch (the version I got this morning from
subversion) and it didn't index a site I tried: apparently, the
unGzipping of the page wasn't successful, for some reason. The log I
was getting was:
050303 211020 fetched 4471 bytes of compressed content (expanded to 0
bytes) from http://www.lesauna.net/index.php3
I played with GZIPUtils a wee bit and realised that in the
unzipBestEffort(byte[] in, int sizeLimit) method, there was a catch
with nothing in it. When I added some log, I had the following
exception:
java.lang.IndexOutOfBoundsException
at
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:89)
at
org.apache.nutch.util.GZIPUtils.unzipBestEffort(GZIPUtils.java:70)
at
org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:166)
at
org.apache.nutch.protocol.http.Http.getContent(Http.java:186)
at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:120)
I think the problem occurs when:
1. the page is gzipped,
2. you set http.content.limit to -1. (sizeLimit - written) is
therefore negative in this method and you get the exception.
I hope this helps.
Regards,
Sébastien.
Découvrez le nouveau Yahoo! Mail : 250 Mo d'espace de stockage pour vos mails !
Créez votre Yahoo! Mail sur http://fr.mail.yahoo.com/
Re: GZipped pages
Posted by Sébastien LE CALLONNEC <sl...@yahoo.ie>.
I realise I forgot to include the method code I wrote to make it work
(even though I think people in here will find a better solution):
public static final byte[] unzipBestEffort(byte[] in, int sizeLimit)
{
try {
// decompress using GZIPInputStream
ByteArrayOutputStream outStream =
new ByteArrayOutputStream(EXPECTED_COMPRESSION_RATIO *
in.length);
GZIPInputStream inStream =
new GZIPInputStream ( new ByteArrayInputStream(in) );
int limit = sizeLimit;
if (sizeLimit < 0) {
limit = EXPECTED_COMPRESSION_RATIO * in.length;
}
byte[] buf = new byte[BUF_SIZE];
int written = 0;
while (true) {
try {
int size = inStream.read(buf);
if (size <= 0)
break;
if ((written + size) > limit) {
outStream.write(buf, 0, limit - written);
break;
}
outStream.write(buf, 0, size);
written+= size;
} catch (Exception e) {
break;
}
}
try {
outStream.close();
} catch (IOException e) {
}
return outStream.toByteArray();
} catch (IOException e) {
return null;
}
}
Regards,
Sébastien.
Découvrez le nouveau Yahoo! Mail : 250 Mo d'espace de stockage pour vos mails !
Créez votre Yahoo! Mail sur http://fr.mail.yahoo.com/