You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2010/09/08 12:47:18 UTC
[Nutch Wiki] Update of "GORA_HBase" by JulienNioche
Dear Wiki user,
You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "GORA_HBase" page has been changed by JulienNioche.
http://wiki.apache.org/nutch/GORA_HBase
--------------------------------------------------
New page:
This document describes how to get Nutch 2.0 to use HBase as a backend for GORA and is based on the revision 993857 of the Nutch trunk
* Install and configure HBase 0.20.6
* Pull the GORA code and compile it
* Copy the jars from gora/gora-hbase/lib-ext to nutch/lib
* Add the following to nutch/ivy/ivy.xml
{{{
<dependency org="org.gora" name="gora-hbase" rev="0.1" conf="*->compile">
<exclude org="com.sun.jdmk"/>
<exclude org="com.sun.jmx"/>
<exclude org="javax.jms"/>
</dependency>
}}}
* Specify the GORA backend in nutch-site.xml
{{{
<property>
<name>storage.data.store.class</name>
<value>org.gora.hbase.store.HBaseStore</value>
<description>Default class for storing data</description>
</property>
}}}
* Add mapping file for hbase in conf/gora-hbase-mapping.xml
{{{
<?xml version="1.0" encoding="UTF-8"?>
<gora-orm>
<table name="webtable">
<family name="p"/> <!-- This can also have params like compression, bloom filters -->
<family name="f"/>
<family name="s"/>
<family name="il"/>
<family name="ol"/>
<family name="h"/>
<family name="mtdt"/>
<family name="mk"/>
</table>
<class table="webtable" keyClass="java.lang.String" name="org.apache.nutch.storage.WebPage">
<!-- fetch fields -->
<field name="baseUrl" family="f" qualifier="bas"/>
<field name="status" family="f" qualifier="st"/>
<field name="prevFetchTime" family="f" qualifier="pts"/>
<field name="fetchTime" family="f" qualifier="ts"/>
<field name="fetchInterval" family="f" qualifier="fi"/>
<field name="retriesSinceFetch" family="f" qualifier="rsf"/>
<field name="reprUrl" family="f" qualifier="rpr"/>
<field name="content" family="f" qualifier="cnt"/>
<field name="contentType" family="f" qualifier="typ"/>
<field name="protocolStatus" family="f" qualifier="prot"/>
<field name="modifiedTime" family="f" qualifier="mod"/>
<!-- parse fields -->
<field name="title" family="p" qualifier="t"/>
<field name="text" family="p" qualifier="c"/>
<field name="parseStatus" family="p" qualifier="st"/>
<field name="signature" family="p" qualifier="sig"/>
<field name="prevSignature" family="p" qualifier="psig"/>
<!-- score fields -->
<field name="score" family="s" qualifier="s"/>
<field name="headers" family="h"/>
<field name="inlinks" family="il"/>
<field name="outlinks" family="ol"/>
<field name="metadata" family="mtdt"/>
<field name="markers" family="mk"/>
</class>
</gora-orm>
}}}
* Compile Nutch -> ant runtime
* Make sure HBase is started and working properly
You should then be able to use it. Try going to'' $NUTCH_HOME/runtime/local/bin'' and do :
{{{
nutch inject /someseedDir
nutch readdb
}}}
You should find more details in the logs on ''$NUTCH_HOME/runtime/local/logs/hadoop.log''