You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by David Stuart <da...@progressivealliance.co.uk> on 2010/09/02 13:58:18 UTC

Nutch 2.0 Help

Hey All,

I have setup the latest version nutch from trunk and am running into a few issues with hbase and injecting urls. when I run the command 

runtime/local/bin/nutch inject runtime/local/seed/

I get 
InjectorJob: java.lang.RuntimeException: Could not create datastore
        at org.apache.nutch.storage.StorageUtils.initMapperJob(StorageUtils.java:70)
        at org.apache.nutch.storage.StorageUtils.initMapperJob(StorageUtils.java:50)
        at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:233)
        at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:246)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:256)

Under the gora properties it should be pointing at localhost/nutchtest and I created that store manually in hbase is that right? 

I have found a few tutorials around nutchbase but the api seems to have changed since the merge with Nutch trunk

Any help would be appreciated and I try to do a how to writeup

Regards,

Dave

Re: Nutch 2.0 Help

Posted by Enis Soztutar <en...@gmail.com>.
Hi,

I think we need to commit all the necessary files to nutch so that it can
work out of the box for sql, hbase and casssandra. We can even write
commented-out entries in gora.properties, nutch-site.xml, etc so that using
nutch with different backends becomes a configuration change. I will open a
issue to track this down.

Cheers,
Enis

On Wed, Sep 8, 2010 at 1:53 PM, Julien Nioche <lists.digitalpebble@gmail.com
> wrote:

> Hi guys,
>
> I've summarized the steps to follow for having GORA+Hbase with Nutch 2.0 on
> http://wiki.apache.org/nutch/GORA_HBase
>
> Feel free to amend and improve as you see fit.
>
> Please bear in mind that Nutch 2.0 is at a very early stage and is far from
> being bug-proof, see in particular [1].
>
> HTH
>
> Julien
>
> [1] https://issues.apache.org/jira/browse/NUTCH-893
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>
>
> On 6 September 2010 13:35, Andrzej Bialecki <ab...@getopt.org> wrote:
>
> > On 2010-09-05 14:56, David Stuart wrote:
> >
> >> Hi All,
> >>
> >> I have done as per below and can create a table from within the hbase
> >> shell. I found the appropriate create table method
> >> bin/nutch org.apache.nutch.storage.WebTableCreator webtable but it only
> >> returns null
> >>
> >> Any help would be great
> >>
> >
> > You don't have to create a table manually - this should happen
> > automatically when you first run any Nutch tool. Just make sure you have
> > hbase-site.xml on your classpath in Nutch - best if you put it in your
> conf/
> > and rebuild, so that it's packed into a job jar.
> >
> > Here's for example my config files that work with HBase (I don't use any
> > non-standard settings for HBase, so my hbase-site.xml has no properties,
> but
> > still it needs to be included in Nutch job jar):
> >
> > gora-hbase-mapping.xml:
> > -------------------------------------------------------------------------
> >
> > <gora-orm>
> >
> > <table name="webtable">
> >  <family name="p"/> <!-- This can also have params like compression,
> bloom
> > filters -->
> >  <family name="f"/>
> >  <family name="s"/>
> >  <family name="il"/>
> >  <family name="ol"/>
> >  <family name="h"/>
> >  <family name="mtdt"/>
> >  <family name="mk"/>
> > </table>
> >
> > <class table="webtable" keyClass="java.lang.String"
> > name="org.apache.nutch.storage.WebPage">
> >  <!-- fetch fields                                       -->
> >  <field name="baseUrl" family="f" qualifier="bas"/>
> >  <field name="status" family="f" qualifier="st"/>
> >  <field name="prevFetchTime" family="f" qualifier="pts"/>
> >  <field name="fetchTime" family="f" qualifier="ts"/>
> >  <field name="fetchInterval" family="f" qualifier="fi"/>
> >  <field name="retriesSinceFetch" family="f" qualifier="rsf"/>
> >  <field name="reprUrl" family="f" qualifier="rpr"/>
> >  <field name="content" family="f" qualifier="cnt"/>
> >  <field name="contentType" family="f" qualifier="typ"/>
> >  <field name="protocolStatus" family="f" qualifier="prot"/>
> >  <field name="modifiedTime" family="f" qualifier="mod"/>
> >
> >  <!-- parse fields                                       -->
> >  <field name="title" family="p" qualifier="t"/>
> >  <field name="text" family="p" qualifier="c"/>
> >  <field name="parseStatus" family="p" qualifier="st"/>
> >  <field name="signature" family="p" qualifier="sig"/>
> >  <field name="prevSignature" family="p" qualifier="psig"/>
> >
> >  <!-- score fields                                       -->
> >  <field name="score" family="s" qualifier="s"/>
> >
> >  <field name="headers" family="h"/>
> >
> >  <field name="inlinks" family="il"/>
> >
> >  <field name="outlinks" family="ol"/>
> >
> >  <field name="metadata" family="mtdt"/>
> >
> >  <field name="markers" family="mk"/>
> >
> > </class>
> >
> > </gora-orm>
> > -------------------------------------------------------------------------
> >
> > nutch-site.xml:
> > -------------------------------------------------------------------------
> > ... blah blah, a lot of unrelated stuff...
> >
> > <property>
> >  <name>storage.data.store.class</name>
> >  <value>org.gora.hbase.store.HBaseStore</value>
> >
> >  <description>Default class for storing data</description>
> > </property>
> > -------------------------------------------------------------------------
> >
> > Of course you need also to use the same hadoop files (hdfs-site and
> > mapred-site) as the ones that HBase uses.
> >
> >
> > --
> > Best regards,
> > Andrzej Bialecki     <><
> >  ___. ___ ___ ___ _ _   __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > http://www.sigram.com  Contact: info at sigram dot com
> >
>

Re: Nutch 2.0 Help

Posted by Enis Soztutar <en...@gmail.com>.
Hi,

I think we need to commit all the necessary files to nutch so that it can
work out of the box for sql, hbase and casssandra. We can even write
commented-out entries in gora.properties, nutch-site.xml, etc so that using
nutch with different backends becomes a configuration change. I will open a
issue to track this down.

Cheers,
Enis

On Wed, Sep 8, 2010 at 1:53 PM, Julien Nioche <lists.digitalpebble@gmail.com
> wrote:

> Hi guys,
>
> I've summarized the steps to follow for having GORA+Hbase with Nutch 2.0 on
> http://wiki.apache.org/nutch/GORA_HBase
>
> Feel free to amend and improve as you see fit.
>
> Please bear in mind that Nutch 2.0 is at a very early stage and is far from
> being bug-proof, see in particular [1].
>
> HTH
>
> Julien
>
> [1] https://issues.apache.org/jira/browse/NUTCH-893
>
> --
> *
> *Open Source Solutions for Text Engineering
>
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
>
>
> On 6 September 2010 13:35, Andrzej Bialecki <ab...@getopt.org> wrote:
>
> > On 2010-09-05 14:56, David Stuart wrote:
> >
> >> Hi All,
> >>
> >> I have done as per below and can create a table from within the hbase
> >> shell. I found the appropriate create table method
> >> bin/nutch org.apache.nutch.storage.WebTableCreator webtable but it only
> >> returns null
> >>
> >> Any help would be great
> >>
> >
> > You don't have to create a table manually - this should happen
> > automatically when you first run any Nutch tool. Just make sure you have
> > hbase-site.xml on your classpath in Nutch - best if you put it in your
> conf/
> > and rebuild, so that it's packed into a job jar.
> >
> > Here's for example my config files that work with HBase (I don't use any
> > non-standard settings for HBase, so my hbase-site.xml has no properties,
> but
> > still it needs to be included in Nutch job jar):
> >
> > gora-hbase-mapping.xml:
> > -------------------------------------------------------------------------
> >
> > <gora-orm>
> >
> > <table name="webtable">
> >  <family name="p"/> <!-- This can also have params like compression,
> bloom
> > filters -->
> >  <family name="f"/>
> >  <family name="s"/>
> >  <family name="il"/>
> >  <family name="ol"/>
> >  <family name="h"/>
> >  <family name="mtdt"/>
> >  <family name="mk"/>
> > </table>
> >
> > <class table="webtable" keyClass="java.lang.String"
> > name="org.apache.nutch.storage.WebPage">
> >  <!-- fetch fields                                       -->
> >  <field name="baseUrl" family="f" qualifier="bas"/>
> >  <field name="status" family="f" qualifier="st"/>
> >  <field name="prevFetchTime" family="f" qualifier="pts"/>
> >  <field name="fetchTime" family="f" qualifier="ts"/>
> >  <field name="fetchInterval" family="f" qualifier="fi"/>
> >  <field name="retriesSinceFetch" family="f" qualifier="rsf"/>
> >  <field name="reprUrl" family="f" qualifier="rpr"/>
> >  <field name="content" family="f" qualifier="cnt"/>
> >  <field name="contentType" family="f" qualifier="typ"/>
> >  <field name="protocolStatus" family="f" qualifier="prot"/>
> >  <field name="modifiedTime" family="f" qualifier="mod"/>
> >
> >  <!-- parse fields                                       -->
> >  <field name="title" family="p" qualifier="t"/>
> >  <field name="text" family="p" qualifier="c"/>
> >  <field name="parseStatus" family="p" qualifier="st"/>
> >  <field name="signature" family="p" qualifier="sig"/>
> >  <field name="prevSignature" family="p" qualifier="psig"/>
> >
> >  <!-- score fields                                       -->
> >  <field name="score" family="s" qualifier="s"/>
> >
> >  <field name="headers" family="h"/>
> >
> >  <field name="inlinks" family="il"/>
> >
> >  <field name="outlinks" family="ol"/>
> >
> >  <field name="metadata" family="mtdt"/>
> >
> >  <field name="markers" family="mk"/>
> >
> > </class>
> >
> > </gora-orm>
> > -------------------------------------------------------------------------
> >
> > nutch-site.xml:
> > -------------------------------------------------------------------------
> > ... blah blah, a lot of unrelated stuff...
> >
> > <property>
> >  <name>storage.data.store.class</name>
> >  <value>org.gora.hbase.store.HBaseStore</value>
> >
> >  <description>Default class for storing data</description>
> > </property>
> > -------------------------------------------------------------------------
> >
> > Of course you need also to use the same hadoop files (hdfs-site and
> > mapred-site) as the ones that HBase uses.
> >
> >
> > --
> > Best regards,
> > Andrzej Bialecki     <><
> >  ___. ___ ___ ___ _ _   __________________________________
> > [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> > ___|||__||  \|  ||  |  Embedded Unix, System Integration
> > http://www.sigram.com  Contact: info at sigram dot com
> >
>

Re: Nutch 2.0 Help

Posted by Julien Nioche <li...@gmail.com>.
Hi guys,

I've summarized the steps to follow for having GORA+Hbase with Nutch 2.0 on
http://wiki.apache.org/nutch/GORA_HBase

Feel free to amend and improve as you see fit.

Please bear in mind that Nutch 2.0 is at a very early stage and is far from
being bug-proof, see in particular [1].

HTH

Julien

[1] https://issues.apache.org/jira/browse/NUTCH-893

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


On 6 September 2010 13:35, Andrzej Bialecki <ab...@getopt.org> wrote:

> On 2010-09-05 14:56, David Stuart wrote:
>
>> Hi All,
>>
>> I have done as per below and can create a table from within the hbase
>> shell. I found the appropriate create table method
>> bin/nutch org.apache.nutch.storage.WebTableCreator webtable but it only
>> returns null
>>
>> Any help would be great
>>
>
> You don't have to create a table manually - this should happen
> automatically when you first run any Nutch tool. Just make sure you have
> hbase-site.xml on your classpath in Nutch - best if you put it in your conf/
> and rebuild, so that it's packed into a job jar.
>
> Here's for example my config files that work with HBase (I don't use any
> non-standard settings for HBase, so my hbase-site.xml has no properties, but
> still it needs to be included in Nutch job jar):
>
> gora-hbase-mapping.xml:
> -------------------------------------------------------------------------
>
> <gora-orm>
>
> <table name="webtable">
>  <family name="p"/> <!-- This can also have params like compression, bloom
> filters -->
>  <family name="f"/>
>  <family name="s"/>
>  <family name="il"/>
>  <family name="ol"/>
>  <family name="h"/>
>  <family name="mtdt"/>
>  <family name="mk"/>
> </table>
>
> <class table="webtable" keyClass="java.lang.String"
> name="org.apache.nutch.storage.WebPage">
>  <!-- fetch fields                                       -->
>  <field name="baseUrl" family="f" qualifier="bas"/>
>  <field name="status" family="f" qualifier="st"/>
>  <field name="prevFetchTime" family="f" qualifier="pts"/>
>  <field name="fetchTime" family="f" qualifier="ts"/>
>  <field name="fetchInterval" family="f" qualifier="fi"/>
>  <field name="retriesSinceFetch" family="f" qualifier="rsf"/>
>  <field name="reprUrl" family="f" qualifier="rpr"/>
>  <field name="content" family="f" qualifier="cnt"/>
>  <field name="contentType" family="f" qualifier="typ"/>
>  <field name="protocolStatus" family="f" qualifier="prot"/>
>  <field name="modifiedTime" family="f" qualifier="mod"/>
>
>  <!-- parse fields                                       -->
>  <field name="title" family="p" qualifier="t"/>
>  <field name="text" family="p" qualifier="c"/>
>  <field name="parseStatus" family="p" qualifier="st"/>
>  <field name="signature" family="p" qualifier="sig"/>
>  <field name="prevSignature" family="p" qualifier="psig"/>
>
>  <!-- score fields                                       -->
>  <field name="score" family="s" qualifier="s"/>
>
>  <field name="headers" family="h"/>
>
>  <field name="inlinks" family="il"/>
>
>  <field name="outlinks" family="ol"/>
>
>  <field name="metadata" family="mtdt"/>
>
>  <field name="markers" family="mk"/>
>
> </class>
>
> </gora-orm>
> -------------------------------------------------------------------------
>
> nutch-site.xml:
> -------------------------------------------------------------------------
> ... blah blah, a lot of unrelated stuff...
>
> <property>
>  <name>storage.data.store.class</name>
>  <value>org.gora.hbase.store.HBaseStore</value>
>
>  <description>Default class for storing data</description>
> </property>
> -------------------------------------------------------------------------
>
> Of course you need also to use the same hadoop files (hdfs-site and
> mapred-site) as the ones that HBase uses.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>

Re: Nutch 2.0 Help

Posted by Julien Nioche <li...@gmail.com>.
Hi guys,

I've summarized the steps to follow for having GORA+Hbase with Nutch 2.0 on
http://wiki.apache.org/nutch/GORA_HBase

Feel free to amend and improve as you see fit.

Please bear in mind that Nutch 2.0 is at a very early stage and is far from
being bug-proof, see in particular [1].

HTH

Julien

[1] https://issues.apache.org/jira/browse/NUTCH-893

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com


On 6 September 2010 13:35, Andrzej Bialecki <ab...@getopt.org> wrote:

> On 2010-09-05 14:56, David Stuart wrote:
>
>> Hi All,
>>
>> I have done as per below and can create a table from within the hbase
>> shell. I found the appropriate create table method
>> bin/nutch org.apache.nutch.storage.WebTableCreator webtable but it only
>> returns null
>>
>> Any help would be great
>>
>
> You don't have to create a table manually - this should happen
> automatically when you first run any Nutch tool. Just make sure you have
> hbase-site.xml on your classpath in Nutch - best if you put it in your conf/
> and rebuild, so that it's packed into a job jar.
>
> Here's for example my config files that work with HBase (I don't use any
> non-standard settings for HBase, so my hbase-site.xml has no properties, but
> still it needs to be included in Nutch job jar):
>
> gora-hbase-mapping.xml:
> -------------------------------------------------------------------------
>
> <gora-orm>
>
> <table name="webtable">
>  <family name="p"/> <!-- This can also have params like compression, bloom
> filters -->
>  <family name="f"/>
>  <family name="s"/>
>  <family name="il"/>
>  <family name="ol"/>
>  <family name="h"/>
>  <family name="mtdt"/>
>  <family name="mk"/>
> </table>
>
> <class table="webtable" keyClass="java.lang.String"
> name="org.apache.nutch.storage.WebPage">
>  <!-- fetch fields                                       -->
>  <field name="baseUrl" family="f" qualifier="bas"/>
>  <field name="status" family="f" qualifier="st"/>
>  <field name="prevFetchTime" family="f" qualifier="pts"/>
>  <field name="fetchTime" family="f" qualifier="ts"/>
>  <field name="fetchInterval" family="f" qualifier="fi"/>
>  <field name="retriesSinceFetch" family="f" qualifier="rsf"/>
>  <field name="reprUrl" family="f" qualifier="rpr"/>
>  <field name="content" family="f" qualifier="cnt"/>
>  <field name="contentType" family="f" qualifier="typ"/>
>  <field name="protocolStatus" family="f" qualifier="prot"/>
>  <field name="modifiedTime" family="f" qualifier="mod"/>
>
>  <!-- parse fields                                       -->
>  <field name="title" family="p" qualifier="t"/>
>  <field name="text" family="p" qualifier="c"/>
>  <field name="parseStatus" family="p" qualifier="st"/>
>  <field name="signature" family="p" qualifier="sig"/>
>  <field name="prevSignature" family="p" qualifier="psig"/>
>
>  <!-- score fields                                       -->
>  <field name="score" family="s" qualifier="s"/>
>
>  <field name="headers" family="h"/>
>
>  <field name="inlinks" family="il"/>
>
>  <field name="outlinks" family="ol"/>
>
>  <field name="metadata" family="mtdt"/>
>
>  <field name="markers" family="mk"/>
>
> </class>
>
> </gora-orm>
> -------------------------------------------------------------------------
>
> nutch-site.xml:
> -------------------------------------------------------------------------
> ... blah blah, a lot of unrelated stuff...
>
> <property>
>  <name>storage.data.store.class</name>
>  <value>org.gora.hbase.store.HBaseStore</value>
>
>  <description>Default class for storing data</description>
> </property>
> -------------------------------------------------------------------------
>
> Of course you need also to use the same hadoop files (hdfs-site and
> mapred-site) as the ones that HBase uses.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>

Re: Nutch 2.0 Help

Posted by Andrzej Bialecki <ab...@getopt.org>.
On 2010-09-05 14:56, David Stuart wrote:
> Hi All,
>
> I have done as per below and can create a table from within the hbase
> shell. I found the appropriate create table method
> bin/nutch org.apache.nutch.storage.WebTableCreator webtable but it only
> returns null
>
> Any help would be great

You don't have to create a table manually - this should happen 
automatically when you first run any Nutch tool. Just make sure you have 
hbase-site.xml on your classpath in Nutch - best if you put it in your 
conf/ and rebuild, so that it's packed into a job jar.

Here's for example my config files that work with HBase (I don't use any 
non-standard settings for HBase, so my hbase-site.xml has no properties, 
but still it needs to be included in Nutch job jar):

gora-hbase-mapping.xml:
-------------------------------------------------------------------------
<gora-orm>

<table name="webtable">
   <family name="p"/> <!-- This can also have params like compression, 
bloom filters -->
   <family name="f"/>
   <family name="s"/>
   <family name="il"/>
   <family name="ol"/>
   <family name="h"/>
   <family name="mtdt"/>
   <family name="mk"/>
</table>

<class table="webtable" keyClass="java.lang.String" 
name="org.apache.nutch.storage.WebPage">
   <!-- fetch fields                                       -->
   <field name="baseUrl" family="f" qualifier="bas"/>
   <field name="status" family="f" qualifier="st"/>
   <field name="prevFetchTime" family="f" qualifier="pts"/>
   <field name="fetchTime" family="f" qualifier="ts"/>
   <field name="fetchInterval" family="f" qualifier="fi"/>
   <field name="retriesSinceFetch" family="f" qualifier="rsf"/>
   <field name="reprUrl" family="f" qualifier="rpr"/>
   <field name="content" family="f" qualifier="cnt"/>
   <field name="contentType" family="f" qualifier="typ"/>
   <field name="protocolStatus" family="f" qualifier="prot"/>
   <field name="modifiedTime" family="f" qualifier="mod"/>

   <!-- parse fields                                       -->
   <field name="title" family="p" qualifier="t"/>
   <field name="text" family="p" qualifier="c"/>
   <field name="parseStatus" family="p" qualifier="st"/>
   <field name="signature" family="p" qualifier="sig"/>
   <field name="prevSignature" family="p" qualifier="psig"/>

   <!-- score fields                                       -->
   <field name="score" family="s" qualifier="s"/>

   <field name="headers" family="h"/>

   <field name="inlinks" family="il"/>

   <field name="outlinks" family="ol"/>

   <field name="metadata" family="mtdt"/>

   <field name="markers" family="mk"/>

</class>

</gora-orm>
-------------------------------------------------------------------------

nutch-site.xml:
-------------------------------------------------------------------------
... blah blah, a lot of unrelated stuff...
<property>
   <name>storage.data.store.class</name>
   <value>org.gora.hbase.store.HBaseStore</value>
   <description>Default class for storing data</description>
</property>
-------------------------------------------------------------------------

Of course you need also to use the same hadoop files (hdfs-site and 
mapred-site) as the ones that HBase uses.


-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Re: Nutch 2.0 Help

Posted by David Stuart <da...@progressivealliance.co.uk>.
Hi All,

I have done as per below and can create a table from within the hbase shell. I found the appropriate create table method
 bin/nutch org.apache.nutch.storage.WebTableCreator webtable  but it only returns null

Any help would be great

Regards

Dave


On 2 Sep 2010, at 13:12, Julien Nioche wrote:

> Hi David,
> 
> I haven't used the Hbase backend with GORA for quite some time but from what I can remember you'll need the following things :
> 
> * conf/hbase-site.xml => this should correspond to your local configuration
> * conf/gora-hbase-mapping.xml => see below
> * conf/gora.properties => don't think there anything you need to specify for Hbase
> 
> * in nutch-site.xml
> 
> <property>
>   <name>storage.data.store.class</name>
>   <value>org.gora.hbase.store.HbaseStore</value>
>   <description>Default class for storing data</description>
> </property>
> 
> and of course all the necessary Hbase jars in the /lib dir - probably easier to modify ivy/ivy.xml so that it includes Hbase
> 
> gora-hbase-mapping.xml  : not sure this is the latest version though 
> 
> <?xml version="1.0" encoding="UTF-8"?>
> 
> <gora-orm>
> 
> <table name="webtable">
>   <family name="p"/> <!-- This can also have params like compression, bloom filters -->
>   <family name="f"/>
>   <family name="s"/>
>   <family name="il"/>
>   <family name="ol"/>
>   <family name="h"/>
>   <family name="mtdt"/>
>   <family name="mk"/>
> </table>
> 
> <class table="webtable" keyClass="java.lang.String" name="org.apache.nutch.storage.WebPage">  
>   <!-- fetch fields                                       -->
>   <field name="baseUrl" family="f" qualifier="bas"/>    
>   <field name="status" family="f" qualifier="st"/>
>   <field name="prevFetchTime" family="f" qualifier="pts"/>
>   <field name="fetchTime" family="f" qualifier="ts"/>
>   <field name="fetchInterval" family="f" qualifier="fi"/>
>   <field name="retriesSinceFetch" family="f" qualifier="rsf"/>
>   <field name="reprUrl" family="f" qualifier="rpr"/>
>   <field name="content" family="f" qualifier="cnt"/>
>   <field name="contentType" family="f" qualifier="typ"/>    
>   <field name="protocolStatus" family="f" qualifier="prot"/>
>   <field name="modifiedTime" family="f" qualifier="mod"/>
> 
>   <!-- parse fields                                       -->
>   <field name="title" family="p" qualifier="t"/>
>   <field name="text" family="p" qualifier="c"/>
>   <field name="parseStatus" family="p" qualifier="st"/>
>   <field name="signature" family="p" qualifier="sig"/>
>   <field name="prevSignature" family="p" qualifier="psig"/>
> 
>   <!-- score fields                                       -->
>   <field name="score" family="s" qualifier="s"/>
> 
>   <field name="headers" family="h"/>
> 
>   <field name="inlinks" family="il"/>
> 
>   <field name="outlinks" family="ol"/>
> 
>   <field name="metadata" family="mtdt"/>
> 
>   <field name="markers" family="mk"/>
> 
> </class>
> 
> </gora-orm>
> 
> 
> HTH
> 
> Good luck!
> 
> Julien
> 
> -- 
> 
> Open Source Solutions for Text Engineering
> 
> http://digitalpebble.blogspot.com/
> http://www.digitalpebble.com
> 
> On 2 September 2010 12:58, David Stuart <da...@progressivealliance.co.uk> wrote:
> Hey All,
> 
> I have setup the latest version nutch from trunk and am running into a few issues with hbase and injecting urls. when I run the command
> 
> runtime/local/bin/nutch inject runtime/local/seed/
> 
> I get
> InjectorJob: java.lang.RuntimeException: Could not create datastore
>        at org.apache.nutch.storage.StorageUtils.initMapperJob(StorageUtils.java:70)
>        at org.apache.nutch.storage.StorageUtils.initMapperJob(StorageUtils.java:50)
>        at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:233)
>        at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:246)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:256)
> 
> Under the gora properties it should be pointing at localhost/nutchtest and I created that store manually in hbase is that right? 
> 
> I have found a few tutorials around nutchbase but the api seems to have changed since the merge with Nutch trunk
> 
> Any help would be appreciated and I try to do a how to writeup
> 
> Regards,
> 
> Dave
> 


Re: Nutch 2.0 Help

Posted by Julien Nioche <li...@gmail.com>.
Hi David,

I haven't used the Hbase backend with GORA for quite some time but from what
I can remember you'll need the following things :

* conf/hbase-site.xml => this should correspond to your local configuration
* conf/gora-hbase-mapping.xml => see below
* conf/gora.properties => don't think there anything you need to specify for
Hbase

* in nutch-site.xml

<property>
  <name>storage.data.store.class</name>
  <value>org.gora.hbase.store.HbaseStore</value>
  <description>Default class for storing data</description>
</property>

and of course all the necessary Hbase jars in the /lib dir - probably easier
to modify ivy/ivy.xml so that it includes Hbase

gora-hbase-mapping.xml  : not sure this is the latest version though

*<?xml version="1.0" encoding="UTF-8"?>

<gora-orm>

<table name="webtable">
  <family name="p"/> <!-- This can also have params like compression, bloom
filters -->
  <family name="f"/>
  <family name="s"/>
  <family name="il"/>
  <family name="ol"/>
  <family name="h"/>
  <family name="mtdt"/>
  <family name="mk"/>
</table>

<class table="webtable" keyClass="java.lang.String"
name="org.apache.nutch.storage.WebPage">
  <!-- fetch fields                                       -->
  <field name="baseUrl" family="f" qualifier="bas"/>
  <field name="status" family="f" qualifier="st"/>
  <field name="prevFetchTime" family="f" qualifier="pts"/>
  <field name="fetchTime" family="f" qualifier="ts"/>
  <field name="fetchInterval" family="f" qualifier="fi"/>
  <field name="retriesSinceFetch" family="f" qualifier="rsf"/>
  <field name="reprUrl" family="f" qualifier="rpr"/>
  <field name="content" family="f" qualifier="cnt"/>
  <field name="contentType" family="f" qualifier="typ"/>
  <field name="protocolStatus" family="f" qualifier="prot"/>
  <field name="modifiedTime" family="f" qualifier="mod"/>

  <!-- parse fields                                       -->
  <field name="title" family="p" qualifier="t"/>
  <field name="text" family="p" qualifier="c"/>
  <field name="parseStatus" family="p" qualifier="st"/>
  <field name="signature" family="p" qualifier="sig"/>
  <field name="prevSignature" family="p" qualifier="psig"/>

  <!-- score fields                                       -->
  <field name="score" family="s" qualifier="s"/>

  <field name="headers" family="h"/>

  <field name="inlinks" family="il"/>

  <field name="outlinks" family="ol"/>

  <field name="metadata" family="mtdt"/>

  <field name="markers" family="mk"/>

</class>

</gora-orm>*


HTH

Good luck!

Julien

-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

On 2 September 2010 12:58, David Stuart <
david.stuart@progressivealliance.co.uk> wrote:

> Hey All,
>
> I have setup the latest version nutch from trunk and am running into a few
> issues with hbase and injecting urls. when I run the command
>
> runtime/local/bin/nutch inject runtime/local/seed/
>
> I get
> InjectorJob: java.lang.RuntimeException: Could not create datastore
>        at
> org.apache.nutch.storage.StorageUtils.initMapperJob(StorageUtils.java:70)
>        at
> org.apache.nutch.storage.StorageUtils.initMapperJob(StorageUtils.java:50)
>        at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:233)
>        at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:246)
>        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
>        at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:256)
>
> Under the gora properties it should be pointing at localhost/nutchtest and
> I created that store manually in hbase is that right?


> I have found a few tutorials around nutchbase but the api seems to have
> changed since the merge with Nutch trunk
>
> Any help would be appreciated and I try to do a how to writeup
>
> Regards,
>
> Dave