You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Soila Pertet (JIRA)" <ji...@apache.org> on 2010/04/24 08:57:56 UTC

[jira] Commented: (NUTCH-650) Hbase Integration

    [ https://issues.apache.org/jira/browse/NUTCH-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12860493#action_12860493 ] 

Soila Pertet commented on NUTCH-650:
------------------------------------

I encountered the following NULL exception while running nutchbase.

2010-04-24 01:58:47,012 WARN org.apache.hadoop.mapred.TaskTracker: Error running child java.lang.NullPointerException at org.apache.hadoop.hbase.io.ImmutableBytesWritable.<init>(ImmutableBytesWritable.java:59) at org.apache.nutch.fetcher.Fetcher$FetcherMapper.map(Fetcher.java:81) at org.apache.nutch.fetcher.Fetcher$FetcherMapper.map(Fetcher.java:77) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:621) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305) at org.apache.hadoop.mapred.Child.main(Child.java:170)

I downloaded nutchbase from svn co http://svn.apache.org/repos/asf/lucene/nutch/branches/nutchbase and applied Xiao's patch. I am running hadoop-0.20.3, hbase-0.20.3 and zookeeper-3.2.2. 

In my application the error occurs after the first iteration of the fetch/generate cycle and is limited to the base url with a generator mark=csh, e.g.:
keyvalues={host:http:8080/wikipedia/de/de/index.html/mtdt:_csh_/1272088691273/Put/vlen=4}

But it works fine for values with generator mark=genmrk, e.g.,:
keyvalues={host:http:8080/wikipedia/de/de/images/wikimedia-button.png/mtdt:__genmrk__/1272088714395/Put/vlen=4, host:http:8080/wikipedia/de/de/images/wikimedia-button.png/mtdt:_csh_/1272088691109/Put/vlen=4}

I modified my map function to check for null values in outKeyRaw in  org.apache.nutch.fetcher.Fetcher$FetcherMapper.map. This masks the error but I am not sure if this is the right action to take. Please let me know.

Thanks.

> Hbase Integration
> -----------------
>
>                 Key: NUTCH-650
>                 URL: https://issues.apache.org/jira/browse/NUTCH-650
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>             Fix For: 2.0
>
>         Attachments: hbase-integration_v1.patch, hbase_v2.patch, malformedurl.patch, meta.patch, meta2.patch, nofollow-hbase.patch, NUTCH-650.patch, nutch-habase.patch, searching.diff, slash.patch
>
>
> This issue will track nutch/hbase integration

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.