You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Sébastien LE CALLONNEC <sl...@yahoo.ie> on 2005/03/03 21:59:22 UTC

GZipped pages

Hi list, 

(Sorry if this isn't the proper list to post this)

I was experimenting with nutch (the version I got this morning from
subversion) and it didn't index a site I tried:  apparently, the
unGzipping of the page wasn't successful, for some reason.  The log I
was getting was:

050303 211020 fetched 4471 bytes of compressed content (expanded to 0
bytes) from http://www.lesauna.net/index.php3

I played with GZIPUtils a wee bit and realised that in the
unzipBestEffort(byte[] in, int sizeLimit) method, there was a catch
with nothing in it.  When I added some log, I had the following
exception:

java.lang.IndexOutOfBoundsException
        at
java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:89)
        at
org.apache.nutch.util.GZIPUtils.unzipBestEffort(GZIPUtils.java:70)
        at
org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:166)
        at
org.apache.nutch.protocol.http.Http.getContent(Http.java:186)
        at
org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:120)

I think the problem occurs when:

1. the page is gzipped,
2. you set http.content.limit to -1.  (sizeLimit - written) is
therefore negative in this method and you get the exception.


I hope this helps.

Regards, 
Sébastien.


	

	
		
Découvrez le nouveau Yahoo! Mail : 250 Mo d'espace de stockage pour vos mails ! 
Créez votre Yahoo! Mail sur http://fr.mail.yahoo.com/

Re: GZipped pages

Posted by Sébastien LE CALLONNEC <sl...@yahoo.ie>.
I realise I forgot to include the method code I wrote to make it work
(even though I think people in here will find a better solution):

  public static final byte[] unzipBestEffort(byte[] in, int sizeLimit)
{
    try {
      // decompress using GZIPInputStream 
    	ByteArrayOutputStream outStream = 
    		new ByteArrayOutputStream(EXPECTED_COMPRESSION_RATIO *
in.length);

      GZIPInputStream inStream = 
        new GZIPInputStream ( new ByteArrayInputStream(in) );
      int limit = sizeLimit;
      if (sizeLimit < 0) {
      	limit = EXPECTED_COMPRESSION_RATIO * in.length;
      }

      byte[] buf = new byte[BUF_SIZE];
      int written = 0;
      while (true) {
        try {
          int size = inStream.read(buf);
          
          if (size <= 0) 
            break;
          if ((written + size) > limit) {
            outStream.write(buf, 0, limit - written);
            break;
          }
          outStream.write(buf, 0, size);
          written+= size;
        } catch (Exception e) {
          break;
        }
      }
      try {
        outStream.close();
      } catch (IOException e) {
      }

      return outStream.toByteArray();

    } catch (IOException e) {
      return null;
    }
  }

Regards, 

Sébastien.


	

	
		
Découvrez le nouveau Yahoo! Mail : 250 Mo d'espace de stockage pour vos mails ! 
Créez votre Yahoo! Mail sur http://fr.mail.yahoo.com/