You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Azhar Jassal <az...@gmail.com> on 2014/09/10 17:03:29 UTC

Can't run Mappers on HBase 0.94 / Nutch 2.3-SNAPSHOT

Hi

I am in the process of upgrading from Nutch 2.2.1 to Nutch 2.3-SNAPSHOT:

I have upgraded HBase from 0.90.4 to 0.94.13 and can scan all of the
pre-existing tables through HBase shell. If I inject new URL's into a new
crawl table, everything works fine. However, when running a job, e.g.
FetcherJob against the tables that pre-exist, I encounter the following
Exception coming from GoraRecordReader- this is preventing FetcherMapper
from running :

java.io.EOFException

        at
org.apache.avro.io.BinaryDecoder.ensureBounds(BinaryDecoder.java:473)

        at org.apache.avro.io.BinaryDecoder.readInt(BinaryDecoder.java:128)

        at
org.apache.avro.io.ValidatingDecoder.readInt(ValidatingDecoder.java:83)

        at
org.apache.avro.generic.GenericDatumReader.readInt(GenericDatumReader.java:376)

        at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:156)

        at
org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:177)

        at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148)

        at
org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:139)

        at
org.apache.gora.hbase.util.HBaseByteInterface.fromBytes(HBaseByteInterface.java:145)

        at
org.apache.gora.hbase.util.HBaseByteInterface.fromBytes(HBaseByteInterface.java:114)

        at
org.apache.gora.hbase.store.HBaseStore.setField(HBaseStore.java:713)

        at
org.apache.gora.hbase.store.HBaseStore.setField(HBaseStore.java:679)

        at
org.apache.gora.hbase.store.HBaseStore.setField(HBaseStore.java:644)

        at
org.apache.gora.hbase.store.HBaseStore.newInstance(HBaseStore.java:625)

        at
org.apache.gora.hbase.query.HBaseResult.readNext(HBaseResult.java:48)

        at
org.apache.gora.hbase.query.HBaseScannerResult.nextInner(HBaseScannerResult.java:54)

        at org.apache.gora.query.impl.ResultBase.next(ResultBase.java:114)

        at
org.apache.gora.mapreduce.GoraRecordReader.nextKeyValue(GoraRecordReader.java:119)

        at
org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:531)

        at
org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)

        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)

        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)

        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)

        at
org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)

        at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)

        at java.util.concurrent.FutureTask.run(FutureTask.java:262)

        at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

        at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

        at java.lang.Thread.run(Thread.java:745)

Like I said, working against a new table is fine- its only against the
existing data (crawlId's). There seems to be something that either Avro
doesn't like about the data- HBase seems to be fine as I can scan tables
and read data directly.

Any ideas?


Az

Re: Can't run Mappers on HBase 0.94 / Nutch 2.3-SNAPSHOT

Posted by Azhar Jassal <az...@gmail.com>.

Just some additional info:

In order to output this Exception I had to hack my copy of Gora 0.4

File: org/apache/gora/mapreduce/GoraRecordReader.java

Otherwise, you can see that the Exception is caught and suppressed. I had
to print it out as otherwise the Mapper fails silently.

Have I missed required step while upgrading to Nutch 2.3/ Gora 0.4/ HBase
0.94.13 that treated the existing data in some way?

Code of GoraRecordReader thats seeding the Mapper failures:

  @Override
  public boolean nextKeyValue() throws IOException, InterruptedException {
	  try{
	    if (counter.isModulo()) {
	      boolean firstBatch = (this.result == null);
	      if (! firstBatch) {
	        this.query.setStartKey(this.result.getKey());
	        if (this.query.getLimit() == counter.getRecordsMax()) {
	          this.query.setLimit(counter.getRecordsMax() + 1);
	        }
	      }
	      if (this.result != null) {
	        this.result.close();
	      }
	
	      executeQuery();
	
	      if (! firstBatch) {
	        // skip first result
	        this.result.next();
	      }
	    }
	
	    counter.increment();
	    return this.result.next();
	  }
	  catch(Exception e){
		return false;
	  }
  }




On Wed, Sep 10, 2014 at 4:03 PM, Azhar Jassal <az...@gmail.com> wrote:

> Hi
>
> I am in the process of upgrading from Nutch 2.2.1 to Nutch 2.3-SNAPSHOT:
>
> I have upgraded HBase from 0.90.4 to 0.94.13 and can scan all of the
> pre-existing tables through HBase shell. If I inject new URL's into a new
> crawl table, everything works fine. However, when running a job, e.g.
> FetcherJob against the tables that pre-exist, I encounter the following
> Exception coming from GoraRecordReader- this is preventing FetcherMapper
> from running :
>
> java.io.EOFException
>
>         at
> org.apache.avro.io.BinaryDecoder.ensureBounds(BinaryDecoder.java:473)
>
>         at org.apache.avro.io.BinaryDecoder.readInt(BinaryDecoder.java:128)
>
>         at
> org.apache.avro.io.ValidatingDecoder.readInt(ValidatingDecoder.java:83)
>
>         at
> org.apache.avro.generic.GenericDatumReader.readInt(GenericDatumReader.java:376)
>
>         at
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:156)
>
>         at
> org.apache.avro.generic.GenericDatumReader.readRecord(GenericDatumReader.java:177)
>
>         at
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:148)
>
>         at
> org.apache.avro.generic.GenericDatumReader.read(GenericDatumReader.java:139)
>
>         at
> org.apache.gora.hbase.util.HBaseByteInterface.fromBytes(HBaseByteInterface.java:145)
>
>         at
> org.apache.gora.hbase.util.HBaseByteInterface.fromBytes(HBaseByteInterface.java:114)
>
>         at
> org.apache.gora.hbase.store.HBaseStore.setField(HBaseStore.java:713)
>
>         at
> org.apache.gora.hbase.store.HBaseStore.setField(HBaseStore.java:679)
>
>         at
> org.apache.gora.hbase.store.HBaseStore.setField(HBaseStore.java:644)
>
>         at
> org.apache.gora.hbase.store.HBaseStore.newInstance(HBaseStore.java:625)
>
>         at
> org.apache.gora.hbase.query.HBaseResult.readNext(HBaseResult.java:48)
>
>         at
> org.apache.gora.hbase.query.HBaseScannerResult.nextInner(HBaseScannerResult.java:54)
>
>         at org.apache.gora.query.impl.ResultBase.next(ResultBase.java:114)
>
>         at
> org.apache.gora.mapreduce.GoraRecordReader.nextKeyValue(GoraRecordReader.java:119)
>
>         at
> org.apache.hadoop.mapred.MapTask$NewTrackingRecordReader.nextKeyValue(MapTask.java:531)
>
>         at
> org.apache.hadoop.mapreduce.MapContext.nextKeyValue(MapContext.java:67)
>
>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:364)
>
>         at
> org.apache.hadoop.mapred.LocalJobRunner$Job$MapTaskRunnable.run(LocalJobRunner.java:223)
>
>         at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>
>         at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
>         at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
>         at java.lang.Thread.run(Thread.java:745)
>
> Like I said, working against a new table is fine- its only against the
> existing data (crawlId's). There seems to be something that either Avro
> doesn't like about the data- HBase seems to be fine as I can scan tables
> and read data directly.
>
> Any ideas?
>
>
> Az
>