You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@nutch.apache.org by Sean Dean <se...@link.enhancededge.com> on 2005/05/13 08:22:47 UTC
Corrupt GZIP trailer
Hello Everyone,
I'm currently having an issue creating an index with anything above around 20000
records.
I get the following error output:
link# bin/nutch index segments/20050511025841
expr: syntax error
050513 021102 parsing file:/usr/local/nutch/conf/nutch-default.xml
050513 021102 parsing file:/usr/local/nutch/conf/nutch-site.xml
050513 021102 No FS indicated, using default:local
050513 021102 indexing segment: segments/20050511025841
050513 021103 * Opening segment 20050511025841
050513 021103 * Indexing segment 20050511025841
050513 021103 Plugins: looking in: /usr/local/nutch/plugins
050513 021103 parsing: /usr/local/nutch/plugins/query-site/plugin.xml
050513 021103 not including: /usr/local/nutch/plugins/parse-ext
050513 021103 not including: /usr/local/nutch/plugins/ontology
050513 021103 parsing: /usr/local/nutch/plugins/protocol-http/plugin.xml
050513 021103 not including: /usr/local/nutch/plugins/parse-pdf
050513 021103 parsing: /usr/local/nutch/plugins/index-basic/plugin.xml
050513 021103 parsing: /usr/local/nutch/plugins/parse-text/plugin.xml
050513 021103 parsing: /usr/local/nutch/plugins/query-url/plugin.xml
050513 021103 not including: /usr/local/nutch/plugins/clustering-carrot2
050513 021103 not including: /usr/local/nutch/plugins/parse-msword
050513 021103 not including: /usr/local/nutch/plugins/query-more
050513 021103 parsing: /usr/local/nutch/plugins/urlfilter-regex/plugin.xml
050513 021104 not including: /usr/local/nutch/plugins/urlfilter-prefix
050513 021104 not including: /usr/local/nutch/plugins/creativecommons
050513 021104 parsing: /usr/local/nutch/plugins/query-basic/plugin.xml
050513 021104 not including: /usr/local/nutch/plugins/language-identifier
050513 021104 parsing: /usr/local/nutch/plugins/parse-html/plugin.xml
050513 021104 found resource common-terms.utf8 at
file:/usr/local/nutch/conf/common-terms.utf8
050513 021444 Processed 20000 records (90.39344 rec/s)
Exception in thread "main" java.io.IOException: Corrupt GZIP trailer
at java.util.zip.GZIPInputStream.readTrailer(GZIPInputStream.java:174)
at java.util.zip.GZIPInputStream.read(GZIPInputStream.java:89)
at
org.apache.nutch.io.WritableUtils.readCompressedByteArray(WritableUtils.java:34)
at
org.apache.nutch.io.WritableUtils.readCompressedString(WritableUtils.java:64)
at org.apache.nutch.parse.ParseText.readFields(ParseText.java:43)
at org.apache.nutch.io.SequenceFile$Reader.next(SequenceFile.java:278)
at org.apache.nutch.io.MapFile$Reader.next(MapFile.java:335)
at org.apache.nutch.io.ArrayFile$Reader.next(ArrayFile.java:61)
at org.apache.nutch.segment.SegmentReader.next(SegmentReader.java:333)
at
org.apache.nutch.indexer.IndexSegment.indexPages(IndexSegment.java:130)
at org.apache.nutch.indexer.IndexSegment.main(IndexSegment.java:254)
I have read up on this issue, even finding a correction but it didnt work for
me. (http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4262583) I wonder if
anyone else has come across this problem and found a solution?
I have tried running it on both FreeBSD 5.4 with JDK1.4.2 and Windows (Cygwin)
with JDK1.4.2 and 1.5 resulting with the same problem detailed above.
Thanks,
Sean
----------------------------------------------------------------
This message was sent using IMP, the Internet Messaging Program.