You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Apache Wiki <wi...@apache.org> on 2010/09/08 12:47:18 UTC

[Nutch Wiki] Update of "GORA_HBase" by JulienNioche

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.

The "GORA_HBase" page has been changed by JulienNioche.
http://wiki.apache.org/nutch/GORA_HBase

--------------------------------------------------

New page:
This document describes how to get Nutch 2.0 to use HBase as a backend for GORA and is based on the revision 993857 of the Nutch trunk

 * Install and configure HBase 0.20.6
 * Pull the GORA code and compile it
 * Copy the jars from gora/gora-hbase/lib-ext to nutch/lib
 * Add the following to nutch/ivy/ivy.xml

{{{
<dependency org="org.gora" name="gora-hbase" rev="0.1" conf="*->compile">
 <exclude org="com.sun.jdmk"/>
 <exclude org="com.sun.jmx"/>
 <exclude org="javax.jms"/>        
</dependency>
}}}

 * Specify the GORA backend in nutch-site.xml

{{{
<property>
 <name>storage.data.store.class</name>
 <value>org.gora.hbase.store.HBaseStore</value>
 <description>Default class for storing data</description>
</property>
}}}

 * Add mapping file for hbase in conf/gora-hbase-mapping.xml

{{{
<?xml version="1.0" encoding="UTF-8"?>
<gora-orm>
<table name="webtable">
  <family name="p"/> <!-- This can also have params like compression, bloom filters -->
  <family name="f"/>
  <family name="s"/>
  <family name="il"/>
  <family name="ol"/>
  <family name="h"/>
  <family name="mtdt"/>
  <family name="mk"/>
</table>
<class table="webtable" keyClass="java.lang.String" name="org.apache.nutch.storage.WebPage">
  <!-- fetch fields                                       -->
  <field name="baseUrl" family="f" qualifier="bas"/>
  <field name="status" family="f" qualifier="st"/>
  <field name="prevFetchTime" family="f" qualifier="pts"/>
  <field name="fetchTime" family="f" qualifier="ts"/>
  <field name="fetchInterval" family="f" qualifier="fi"/>
  <field name="retriesSinceFetch" family="f" qualifier="rsf"/>
  <field name="reprUrl" family="f" qualifier="rpr"/>
  <field name="content" family="f" qualifier="cnt"/>
  <field name="contentType" family="f" qualifier="typ"/>
  <field name="protocolStatus" family="f" qualifier="prot"/>
  <field name="modifiedTime" family="f" qualifier="mod"/>
  <!-- parse fields                                       -->
  <field name="title" family="p" qualifier="t"/>
  <field name="text" family="p" qualifier="c"/>
  <field name="parseStatus" family="p" qualifier="st"/>
  <field name="signature" family="p" qualifier="sig"/>
  <field name="prevSignature" family="p" qualifier="psig"/>
  <!-- score fields                                       -->
  <field name="score" family="s" qualifier="s"/>
  <field name="headers" family="h"/>
  <field name="inlinks" family="il"/>
  <field name="outlinks" family="ol"/>
  <field name="metadata" family="mtdt"/>
  <field name="markers" family="mk"/>
</class>
</gora-orm>
}}}
 * Compile Nutch -> ant runtime
 * Make sure HBase is started and working properly

You should then be able to use it. Try going to'' $NUTCH_HOME/runtime/local/bin'' and do :

{{{
  nutch inject /someseedDir
  nutch readdb
}}}

You should find more details in the logs on ''$NUTCH_HOME/runtime/local/logs/hadoop.log''