You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Lewis John McGibbney (JIRA)" <ji...@apache.org> on 2014/02/06 20:52:11 UTC

[jira] [Comment Edited] (NUTCH-1723) nutch updatedb fails due to avro (de)serialization issues on images

    [ https://issues.apache.org/jira/browse/NUTCH-1723?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13893718#comment-13893718 ] 

Lewis John McGibbney edited comment on NUTCH-1723 at 2/6/14 7:52 PM:
---------------------------------------------------------------------

hi [~ksmets], this is a good catch but a pretty nasty one to deal with.
We are currently working on a GORA_94 branch which is an Avro upgrade 1.3.3 --> 1.7.X and new persistency API... I am trying to focus my time on more pressing issues such as this one so I'm personally not going to try and get a fix right now.
If this issue is still present when we release Gora 0.3 then I'll look in to this in detail.
Thanks for logging this bug... it is a PITA indeed! 

Are you able to resume crawls or does the job task fail entirely?


was (Author: lewismc):
hi [~ksmets], this is a good catch but a pretty nasty one to deal with.
We are currently working on a GORA_94 branch which is an Avro upgrade 1.3.3 --> 1.7.X and new persistency API... I am trying to focus my time on more pressing issues such as this one so I'm personally not going to try and get a fix right now.
If this issue is still present when we release Gora 0.3 then I'll look in to this in detail.
Thanks for logging this bug... it is a PITA indeed! 

> nutch updatedb fails due to avro (de)serialization issues on images
> -------------------------------------------------------------------
>
>                 Key: NUTCH-1723
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1723
>             Project: Nutch
>          Issue Type: Bug
>          Components: crawldb, parser
>    Affects Versions: 2.3, 2.2.1
>         Environment: - Ubuntu 12.04.3 LTS (GNU/Linux 3.2.0-36-generic x86_64)
> - DataStax Community Edition Apache Cassandra 2.0.4
>            Reporter: Koen Smets
>              Labels: avro, cassandra, gora, gora-cassandra, nutch, tika
>             Fix For: 2.3
>
>
> Running `bin/crawl` for 2 iterations using either the nutch-2.2.1 release or  the latest 2.x checkout on a seed file containing for example http://www.mountsinai.on.ca and http://www.dhzb.de (or any other webpage with image files with no obvious file extensions) causes to throw either java.lang.IllegalArgument, IOException and/or OutOfBoundsExceptions in the the readFields function of WebPageWritable:
>   @Override
>   public void readFields(DataInput in) throws IOException {
>     webPage = IOUtils.deserialize(getConf(), in, webPage, WebPage.class);
>   }
>   @Override
>   public void write(DataOutput out) throws IOException {
>     IOUtils.serialize(getConf(), out, webPage, WebPage.class);
>   }
> 2014-02-04 13:50:15,421 INFO  util.WebPageWritable - Try reading fields: ...
> 2014-02-04 13:50:15,423 ERROR util.WebPageWritable - Error - Failed to read fields: http://www.mountsinai.on.ca/carousel/patient-care-banner/image
> 2014-02-04 13:50:15,423 ERROR util.WebPageWritable - Error - Reading fields of the WebPage class failed - java.lang.IllegalArgumentException
> 2014-02-04 13:50:15,425 ERROR util.WebPageWritable - Error - Printing stacktrace - java.lang.IllegalArgumentException
> Or, 
> java.lang.IndexOutOfBoundsException
>         at java.nio.Buffer.checkBounds(Buffer.java:559)
>         at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:143)
>         at org.apache.avro.ipc.ByteBufferInputStream.read(ByteBufferInputStream.java:52)
>         at org.apache.avro.io.DirectBinaryDecoder.doReadBytes(DirectBinaryDecoder.java:183)
>         at org.apache.avro.io.BinaryDecoder.readString(BinaryDecoder.java:265)
>         at org.apache.gora.mapreduce.FakeResolvingDecoder.readString(FakeResolvingDecoder.java:131)
>         at org.apache.avro.generic.GenericDatumReader.readString(GenericDatumReader.java:280)
>         at org.apache.avro.generic.GenericDatumReader.readMap(GenericDatumReader.java:191)
>         at org.apache.gora.avro.PersistentDatumReader.readMap(PersistentDatumReader.java:183)
>         at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:83)
>         at org.apache.gora.avro.PersistentDatumReader.readRecord(PersistentDatumReader.java:139)
>         at org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:80)
>         at org.apache.gora.avro.PersistentDatumReader.read(PersistentDatumReader.java:103)
>         at org.apache.gora.avro.PersistentDatumReader.read(PersistentDatumReader.java:98)
>         at org.apache.gora.mapreduce.PersistentDeserializer.deserialize(PersistentDeserializer.java:73)
>         at org.apache.gora.mapreduce.PersistentDeserializer.deserialize(PersistentDeserializer.java:36)
>         at org.apache.gora.util.IOUtils.deserialize(IOUtils.java:205)
>         at org.apache.nutch.util.WebPageWritable.readFields(WebPageWritable.java:45)
>         at org.apache.nutch.util.GenericWritableConfigurable.readFields(GenericWritableConfigurable.java:54)
>         at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:67)
>         at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:40)
>         at org.apache.hadoop.mapreduce.ReduceContext.nextKeyValue(ReduceContext.java:117)
>         at org.apache.hadoop.mapreduce.ReduceContext.nextKey(ReduceContext.java:92)
>         at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
>         at org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:649)
>         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
>         at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
> The exceptions are caused by image files that sneak through the urlfilter (no extension indicating an image file) and that get (properly?) parsed by tika library.
> Note that silently catching the thrown exceptions causes corruption of the Cassandra database, as the deserializer reads over multiple webpage entries in the DataInput. Resulting in a loss of several pages of other host present in the seed file.
> Moreover, if one makes sure that the image pages don't end up in the DataInput written by DBUpdateMapper, e.g. by configuring nutch-site.xml to disable the tika parser, the nutch dbupdate finishes properly.
> <property>
>   <name>plugin.excludes</name>
>   <value>parse-tika</value>
> </property>
> I highly suspect that the issues are due to gora's dependency on the outdated avro-1.3.3 library.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)