You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nutch.apache.org by kaveh minooie <ka...@plutoz.com> on 2013/04/01 23:45:22 UTC

Re: error using generate in 2.x

Hi

first of all I am posting this to both user and dev list since this is 
becoming a dev issue more than anything else, and it seems to me that 
this issue needs to be moved to that list, but let me know if I am wrong 
cause I don't want to generate any more spam than we are already getting 
in those lists.

The patch NUTCH-1551 didn't solve my issue. I am still getting the same 
exact error when i try to run generate. (this was run in local mode) :

2013-04-01 11:43:27,710 INFO  store.HBaseStore - Keyclass and nameclass 
match but mismatching table names  mappingfile schema is 'webpage' vs 
actual schema 't1_webpage' , assuming they are the same.
2013-04-01 11:43:27,718 INFO  mapreduce.GoraRecordWriter - 
gora.buffer.write.limit = 10000
2013-04-01 11:43:27,838 WARN  mapred.FileOutputCommitter - Output path 
is null in cleanup
2013-04-01 11:43:27,839 WARN  mapred.LocalJobRunner - job_local_0001
java.lang.NullPointerException
         at org.apache.gora.hbase.store.HBaseStore.put(HBaseStore.java:235)
         at 
org.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:60)
         at 
org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:588)
         at 
org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
         at 
org.apache.nutch.crawl.GeneratorReducer.reduce(GeneratorReducer.java:79)
         at 
org.apache.nutch.crawl.GeneratorReducer.reduce(GeneratorReducer.java:40)
         at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
         at 
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:650)
         at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
         at 
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260)
2013-04-01 11:43:28,763 ERROR crawl.GeneratorJob - GeneratorJob: 
java.lang.RuntimeException: job failed: name=[t1]generate: 
1364841802-1763249246, jobid=job_local_0001
         at 
org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
         at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:193)
         at 
org.apache.nutch.crawl.GeneratorJob.generate(GeneratorJob.java:219)
         at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:264)
         at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
         at org.apache.nutch.crawl.GeneratorJob.main(GeneratorJob.java:272)



now i did a little bit of tracing and now I am not sure whether it is a 
nutch issue or gora anymore because:

the original error (NPE) come from here (in 
gora/blob/trunk/gora-hbase/src/main/java/org/apache/gora/hbase/store/HBaseStore.java:235)


   case MAP:
             if(o instanceof StatefulMap) {
               StatefulHashMap<Utf8, ?> map = (StatefulHashMap<Utf8, ?>) o;
               for (Entry<Utf8, State> e : map.states().entrySet()) {
                 Utf8 mapKey = e.getKey();
                 switch (e.getValue()) {
                   case DIRTY:
--->                byte[] qual = Bytes.toBytes(mapKey.toString());
                     byte[] val = toBytes(map.get(mapKey), 
field.schema().getValueType());
                     put.add(hcol.getFamily(), qual, val);
                     hasPuts = true;
                     break;
                   case DELETED:
                     qual = Bytes.toBytes(mapKey.toString());
                     hasDeletes = true;
                     delete.deleteColumn(hcol.getFamily(), qual);
                     break;
                 }
               }
             } else {

now the likely variable that is null seems to be 'mapkey' which is 
probably as a result of male formed URL ( thou I can't say that for sure )

now the put function is being called from here

this is from gora 2.1:

gora/blob/0.2.1/gora-core/src/main/java/org/apache/gora/mapreduce/GoraRecordWriter.java:


   @Override
   public void write(K key, T value) throws IOException, 
InterruptedException {
     store.put(key, (Persistent) value);

     counter.increment();
     if (counter.isModulo()) {
       LOG.info("Flushing the datastore after " + 
counter.getRecordsNumber() + " records");
       store.flush();
     }
   }
}


the same function in gora trunk is like this:

public void write(K key, T value) throws IOException, InterruptedException {
	  try{
	    store.put(key, (Persistent) value);

	    counter.increment();
	    if (counter.isModulo()) {
	      LOG.info("Flushing the datastore after " + 
counter.getRecordsNumber() + " records");
	      store.flush();
	    }
	  }catch(Exception e){
		  LOG.info("Exception at GoraRecordWriter.class while writing to 
datastore." + e.getMessage());
	  }
   }


which seems to me that would allow the code to recover from this kind of 
errors. now I get gora through ivy and I don't know how or if I can have 
ivy to fetch the trunk but regardless I still think the question remains 
whether it is a nutch issue or gora?


sorry for the long email.


On 03/30/2013 04:03 PM, Lewis John Mcgibbney wrote:
> I think we need also may need to add the BATCH_ID to one Job's HashSet
>
> private static final Collection<WebPage.Field> FIELDS = new
> HashSet<WebPage.Field>();
> static {
> ...
>    FIELDS.add(WebPage.Field.BATCH_ID);
> }
>
>
> On Sat, Mar 30, 2013 at 3:55 PM, Lewis John Mcgibbney <
> lewis.mcgibbney@gmail.com> wrote:
>
>> Hi,
>> I've tried to sort this out locally this morning...
>> I can almost replicate this behaviour with gora-cassandra and it looks
>> most likely that the patch(es) applied in
>> * NUTCH-1533 - NUTCH-1532 Implement getPrevModifiedTime(),
>> setPrevModifiedTime(), getBatchId() and setBatchId() accessors in
>> o.a.n.storage.WebPage, and
>> * NUTCH-1532 - Replace 'segment' mapping field with batchId,
>> respectively are not backwards compatible because some URLs within the web
>> database do not contain values to the batchId.
>> Of course this is a major problem.
>> I opened NUTCH-1551 [0] and submitted a patch to make WebTableReader
>> backwards compatible with the above patches. Please try out the patch if
>> you can and comment so I can commit.
>>
>> We have a couple options here.
>> 1) Revert both of the above until we can get a fix
>> 2) Get a fix just now and commit it.
>> What do you guys want to do?
>>
>> I have a question about whether or not we can dynamically add fields to
>> existing data base entires by injecting them?
>> Say for example, you inject URLs without the batchId field in your mapping
>> file, then add the field and inject some more URLs... will the field be
>> added to your data base? If so then why are we getting the NPE?
>> There must be some other location in the Nutch code where an asserted
>> attempt is being made to obtain the batchId fore some given key... it
>> cannot be obtained and we receive the NPE.
>>
>> [0] https://issues.apache.org/jira/browse/NUTCH-1551
>>
>>
>> On Fri, Mar 29, 2013 at 5:05 PM, kaveh minooie <ka...@plutoz.com> wrote:
>>
>>> I use git and i fetch from github (https://github.com/apache/**nutch.git<https://github.com/apache/nutch.git>) currently I am on this commit:
>>>
>>> commit 4bb01d6b908dc230c8be89d398b03a**86581ec42b
>>> Author: lufeng <fe...@apache.org>
>>> Date:   Thu Mar 28 13:09:09 2013 +0000
>>>
>>>      NUTCH-1547 BasicIndexingFilter - Problem to index full title
>>>
>>>      git-svn-id: https://svn.apache.org/repos/**
>>> asf/nutch/branches/2.x@1462079<ht...@1462079>13f79535-47bb-0310-9956-
>>> **ffa450edef68
>>>
>>>
>>> before I was on this commit :
>>>
>>>
>>> commit f02dcf62566583551426c08bd38808**0e5b2bc93e
>>>
>>>>   f02dcf6 NUTCH-XX remove unused db.max.inlinks from nutch-default.xml
>>>
>>>
>>> On 03/29/2013 04:35 PM, alxsss@aim.com wrote:
>>>
>>>> Yes, with hbase. Here is the error
>>>>
>>>> 13/03/29 16:33:29 INFO zookeeper.ZooKeeper: Session: 0x13d7770d67d005f
>>>> closed
>>>> 13/03/29 16:33:29 ERROR crawl.WebTableReader: WebTableReader:
>>>> java.lang.NullPointerException
>>>>           at org.apache.gora.hbase.store.**HBaseStore.addFields(**
>>>> HBaseStore.java:398)
>>>>           at org.apache.gora.hbase.store.**HBaseStore.execute(HBaseStore.
>>>> **java:360)
>>>>           at org.apache.nutch.crawl.**WebTableReader.read(**
>>>> WebTableReader.java:234)
>>>>           at org.apache.nutch.crawl.**WebTableReader.run(**
>>>> WebTableReader.java:476)
>>>>           at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**
>>>> java:65)
>>>>           at org.apache.nutch.crawl.**WebTableReader.main(**
>>>> WebTableReader.java:412)
>>>>           at sun.reflect.**NativeMethodAccessorImpl.**invoke0(Native
>>>> Method)
>>>>           at sun.reflect.**NativeMethodAccessorImpl.**invoke(**
>>>> NativeMethodAccessorImpl.java:**39)
>>>>           at sun.reflect.**DelegatingMethodAccessorImpl.**invoke(**
>>>> DelegatingMethodAccessorImpl.**java:25)
>>>>           at java.lang.reflect.Method.**invoke(Method.java:597)
>>>>           at org.apache.hadoop.util.RunJar.**main(RunJar.java:156)
>>>>
>>>>
>>>> If I revert to previous release it works fine.
>>>>
>>>> Thanks.
>>>> Alex.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Lewis John Mcgibbney <le...@gmail.com>
>>>> To: user <us...@nutch.apache.org>
>>>> Sent: Fri, Mar 29, 2013 4:30 pm
>>>> Subject: Re: error using generate in 2.x
>>>>
>>>>
>>>> Hi Alex,
>>>> With HBase also?
>>>> There 'was' a bug in gora-cassandra module for this command + params
>>>> however I thought it had been addressed and therefore resolved it.
>>>> Lewis
>>>>
>>>>
>>>> On Fri, Mar 29, 2013 at 4:00 PM, <al...@aim.com> wrote:
>>>>
>>>>   Hi,
>>>>>
>>>>> It seems that trunk has a few bugs. I found out that readdb -url urlname
>>>>> also gives errors.
>>>>>
>>>>> Thanks.
>>>>> Alex.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: kaveh minooie <ka...@plutoz.com>
>>>>> To: user <us...@nutch.apache.org>
>>>>> Sent: Fri, Mar 29, 2013 1:53 pm
>>>>> Subject: Re: error using generate in 2.x
>>>>>
>>>>>
>>>>> Hi lewis
>>>>>
>>>>> the mapping file that I am using is the one that comes with nutch, and I
>>>>> haven't touched it. this message in the log is caused by using the
>>>>> -crawlId on the command line. for example this log was the result of
>>>>> this command :
>>>>>
>>>>> bin/nutch generate -topN 1000 -crawlId t1
>>>>>
>>>>> which causes the nutch( or i guess technically gora ) to use a table
>>>>> name 't1_webpage'. thou, I have to say that i don't understand the
>>>>> rational behind the code generating a warning like this ( I mean I know
>>>>> it is not actually a warning, just that the way the message has been
>>>>> phrased makes it look like warning) for something that should be a
>>>>> routine operation. for someone like me who is crawling ( i mean hoping
>>>>> to cause it is not working right now ) thousands of websites to maintain
>>>>> multiple crawldb ( or its equivalent in gora, webpage table ) for
>>>>> different group of websites.
>>>>>
>>>>>
>>>>> Now that being said, it has nothing to do with the problem that I am
>>>>> having. it is the same when I ommit the -crawlId parameter ( forcing it
>>>>> to use the default name webpage ), and more importantly it is new. I
>>>>> haven't had this problem before, it just started to happening 2 days ago
>>>>> when i pulled the latest commits to 2.x branch.
>>>>>
>>>>>
>>>>> On 03/29/2013 09:50 AM, Lewis John Mcgibbney wrote:
>>>>>
>>>>>> Hi Kaveh,
>>>>>> Firstly, as logged below, Gora attempts to associate your HBase table
>>>>>> configuration with specified tables (from within
>>>>>> gora-hbase-mapping.xml)
>>>>>> however it seems that your case satisfies the condition "if
>>>>>> (!tableName.equals(**tableNameFromMapping))" meaining that the table
>>>>>> name
>>>>>>
>>>>> is
>>>>>
>>>>>> not equal to the value for the table name attribute or that this value
>>>>>> is
>>>>>> null.
>>>>>> This is allowed, but I am interested to find out what the mapping file
>>>>>> looks like... the entire file is not required, just the <class
>>>>>>
>>>>> name="value"
>>>>>
>>>>>> snippet if this is possible.
>>>>>> I am not using the gora-hbase module and haven't ever seen anyone come
>>>>>> across this problem before.
>>>>>> Thanks
>>>>>> Lewis
>>>>>>
>>>>>> On Thursday, March 28, 2013, kaveh minooie <ka...@plutoz.com> wrote:
>>>>>>
>>>>>>   2013-03-28 11:06:25,158 INFO  store.HBaseStore - Keyclass and
>>>>>>> nameclass
>>>>>>>
>>>>>> match but mismatching table names  mappingfile schema is 'webpage' vs
>>>>>> actual schema 't1_webpage' , assuming they are the same.
>>>>>>
>>>>>>
>>>>> --
>>>>> Kaveh Minooie
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>> --
>>> Kaveh Minooie
>>>
>>
>>
>>
>> --
>> *Lewis*
>>
>
>
>

-- 
Kaveh Minooie

Re: error using generate in 2.x

Posted by kaveh minooie <ka...@plutoz.com>.

ok so i got
gora-core-0.3-20130401.060419-325.jar
gora-hbase-0.3-20130401.065448-305.jar

and when I run generate the code finished without any exception but the 
log file was full of lines like this (one for every url that I had in 
webpage table)

INFO  mapreduce.GoraRecordWriter - Exception at GoraRecordWriter.class 
while writing to datastore.HBase mapping for field 
[org.apache.nutch.storage.WebPage#batchId] not found. Wrong 
gora-hbase-mapping.xml?


when i checked gora-hbase-mapping.xml there was no field for batchId

so I copied this line from  gora-cassandra-mapping.xml

<field name="batchId" family="f" qualifier="bid"/>

after that everything (and by that I mean generate fetch updatedb) 
worked fine. So now here are my questions:

1- as I said that line is missing for gora-hbase-mapping.xml. does this 
needs an jira issue or can you guys just add it and commit without going 
through all the hoops?

2- is the trunk version supposed to be compiled against the gora trunk? 
cause the current HEAD is not working with 0.2.1?

P.S this by the way worked the same with and without NUTCH-1551 patch



On 04/01/2013 03:28 PM, Lewis John Mcgibbney wrote:
> You're right, this is a dev issue for sure.
>
>
> On Mon, Apr 1, 2013 at 2:45 PM, kaveh minooie <kaveh@plutoz.com
> <ma...@plutoz.com>> wrote:
>
>     The patch NUTCH-1551 didn't solve my issue. I am still getting the
>     same exact error when i try to run generate. (this was run in local
>     mode) :
>
>
> NUTCH-1551 is not supposed to fix this problem entirely. It merely
> attempts to make the WebTableReader tool backwards compatible and
> permits you to check whether accesor methods WebPage.getBatchID() and
> WebPage.getPrevModifiedTime() actually work for your use case. If you
> are able to check and provide feedback of the webtable dump for the URL
> causing the NPE it would be very valuable indeed.
>
>
>     now the likely variable that is null seems to be 'mapkey' which is
>     probably as a result of male formed URL ( thou I can't say that for
>     sure )
>
>     now the put function is being called from here
>
>     this is from gora 2.1:
>
>     gora/blob/0.2.1/gora-core/src/__main/java/org/apache/gora/__mapreduce/GoraRecordWriter.__java:
>
>     ...
>
>
>     the same function in gora trunk is like this:
>     ...
>
>     which seems to me that would allow the code to recover from this
>     kind of errors. now I get gora through ivy and I don't know how or
>     if I can have ivy to fetch the trunk but regardless I still think
>     the question remains whether it is a nutch issue or gora?
>
> So it appears that some issues have been addressed and improved within
> Gora trunk (which is nice). You can pull a Gora SNAPSHOT from here [0]
> and place it on your class path then try it out. Feedback would be
> greatly appreciated.
>
> The underlying problem here is that not everyone using and developing
> Gora is using and developing Nutch. We have been making good progress
> towards building diversity over in Gora so that it is not so heavily
> reliant upon Nutch users. This means the project can stand on its own
> two feet. The downside of this, is that *some* bugs arising from *some*
> use cases are not discovered until a little later than we would like.
> Your feedback is really really helpful.
>
> It should be noted that you can also patch your local copy of 2.x HEAD
> to not contain the two offending issues we've previously discussed.
>
> [0]
> https://repository.apache.org/content/repositories/snapshots/org/apache/gora/

-- 
Kaveh Minooie

Re: error using generate in 2.x

Posted by Lewis John Mcgibbney <le...@gmail.com>.

You're right, this is a dev issue for sure.

On Mon, Apr 1, 2013 at 2:45 PM, kaveh minooie <ka...@plutoz.com> wrote:

> The patch NUTCH-1551 didn't solve my issue. I am still getting the same
> exact error when i try to run generate. (this was run in local mode) :
>

NUTCH-1551 is not supposed to fix this problem entirely. It merely attempts
to make the WebTableReader tool backwards compatible and permits you to
check whether accesor methods WebPage.getBatchID() and
WebPage.getPrevModifiedTime() actually work for your use case. If you are
able to check and provide feedback of the webtable dump for the URL causing
the NPE it would be very valuable indeed.

>
> now the likely variable that is null seems to be 'mapkey' which is
> probably as a result of male formed URL ( thou I can't say that for sure )
>
> now the put function is being called from here
>
> this is from gora 2.1:
>
> gora/blob/0.2.1/gora-core/src/**main/java/org/apache/gora/**
> mapreduce/GoraRecordWriter.**java:
>
> ...
>
>
> the same function in gora trunk is like this:
> ...
>
> which seems to me that would allow the code to recover from this kind of
> errors. now I get gora through ivy and I don't know how or if I can have
> ivy to fetch the trunk but regardless I still think the question remains
> whether it is a nutch issue or gora?
>
> So it appears that some issues have been addressed and improved within
Gora trunk (which is nice). You can pull a Gora SNAPSHOT from here [0] and
place it on your class path then try it out. Feedback would be greatly
appreciated.

The underlying problem here is that not everyone using and developing Gora
is using and developing Nutch. We have been making good progress towards
building diversity over in Gora so that it is not so heavily reliant upon
Nutch users. This means the project can stand on its own two feet. The
downside of this, is that *some* bugs arising from *some* use cases are not
discovered until a little later than we would like.
Your feedback is really really helpful.

It should be noted that you can also patch your local copy of 2.x HEAD to
not contain the two offending issues we've previously discussed.

[0]
https://repository.apache.org/content/repositories/snapshots/org/apache/gora/