You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Alban Mouton <al...@gmail.com> on 2009/12/05 15:56:49 UTC

State of nutchbase

Hello,

I have looked a little into nutch code and mailing lists. I think the
nutchbase branch (http://issues.apache.org/jira/browse/NUTCH-650) is very
interesting, with a good potential to improve code clarity and flexibility
(I find data structure quite obscure in current version). The issue is
untouched since last august, so my question is : can nutchbase really be
part of nutch 1.1 ? Is there still much work to do or is it almost ready ?
Is it a worthy issue for an interested developer with a (still !) limited
knowledge of the project ?

So far I have only tried to run nutchbase in eclipse by applying the
tutorial (http://wiki.apache.org/nutch/RunNutchInEclipse1.0) but I run in
errors when building, mostly from Parser and tests. I may start by cleaning
this up.

Eclipse build errors:

Description    Resource    Path    Location    Type
FetcherOutputFormat cannot be resolved to a type
ArcSegmentCreator.java    /nutchbase/src/java/org/apache/nutch/tools/arc
line 362    Java Problem
Generator.GENERATE_MAX_PER_HOST_BY_IP cannot be resolved
TestGenerator.java    /nutchbase/src/test/org/apache/nutch/crawl    line
202    Java Problem
ParseImpl cannot be resolved to a type    ArcSegmentCreator.java
/nutchbase/src/java/org/apache/nutch/tools/arc    line 229    Java Problem
ParseImpl cannot be resolved to a type    BasicFields.java
/nutchbase/src/java/org/apache/nutch/indexer/field    line 335    Java
Problem
ParseImpl cannot be resolved to a type    ExtParser.java
/nutchbase/src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext    line
138    Java Problem
ParseImpl cannot be resolved to a type    MSBaseParser.java
/nutchbase/src/plugin/lib-parsems/src/java/org/apache/nutch/parse/ms    line
108    Java Problem
ParseImpl cannot be resolved to a type    OOParser.java
/nutchbase/src/plugin/parse-oo/src/java/org/apache/nutch/parse/oo    line
103    Java Problem
ParseImpl cannot be resolved to a type    PdfParser.java
/nutchbase/src/plugin/parse-pdf/src/java/org/apache/nutch/parse/pdf    line
155    Java Problem
ParseImpl cannot be resolved to a type    RSSParser.java
/nutchbase/src/plugin/parse-rss/src/java/org/apache/nutch/parse/rss    line
187    Java Problem
ParseImpl cannot be resolved to a type    SWFParser.java
/nutchbase/src/plugin/parse-swf/src/java/org/apache/nutch/parse/swf    line
113    Java Problem
ParseImpl cannot be resolved to a type    TestIndexingFilters.java
/nutchbase/src/test/org/apache/nutch/indexer    line 45    Java Problem
ParseImpl cannot be resolved to a type    TestMoreIndexingFilter.java
/nutchbase/src/plugin/index-more/src/test/org/apache/nutch/indexer/more
line 61    Java Problem
ParseImpl cannot be resolved to a type    TextParser.java
/nutchbase/src/plugin/parse-text/src/java/org/apache/nutch/parse/text
line 55    Java Problem
ParseImpl cannot be resolved to a type    ZipParser.java
/nutchbase/src/plugin/parse-zip/src/java/org/apache/nutch/parse/zip    line
105    Java Problem
ParseResult cannot be resolved    ExtParser.java
/nutchbase/src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext    line
137    Java Problem
ParseResult cannot be resolved    MSBaseParser.java
/nutchbase/src/plugin/lib-parsems/src/java/org/apache/nutch/parse/ms    line
107    Java Problem
ParseResult cannot be resolved    OOParser.java
/nutchbase/src/plugin/parse-oo/src/java/org/apache/nutch/parse/oo    line
103    Java Problem
ParseResult cannot be resolved    PdfParser.java
/nutchbase/src/plugin/parse-pdf/src/java/org/apache/nutch/parse/pdf    line
155    Java Problem
ParseResult cannot be resolved    RSSParser.java
/nutchbase/src/plugin/parse-rss/src/java/org/apache/nutch/parse/rss    line
187    Java Problem
ParseResult cannot be resolved    SWFParser.java
/nutchbase/src/plugin/parse-swf/src/java/org/apache/nutch/parse/swf    line
113    Java Problem
ParseResult cannot be resolved    TextParser.java
/nutchbase/src/plugin/parse-text/src/java/org/apache/nutch/parse/text
line 55    Java Problem
ParseResult cannot be resolved    ZipParser.java
/nutchbase/src/plugin/parse-zip/src/java/org/apache/nutch/parse/zip    line
105    Java Problem
ParseResult cannot be resolved to a type    ArcSegmentCreator.java
/nutchbase/src/java/org/apache/nutch/tools/arc    line 159    Java Problem
ParseResult cannot be resolved to a type    CCParseFilter.java
/nutchbase/src/plugin/creativecommons/src/java/org/creativecommons/nutch
line 267    Java Problem
ParseResult cannot be resolved to a type    CCParseFilter.java
/nutchbase/src/plugin/creativecommons/src/java/org/creativecommons/nutch
line 267    Java Problem
ParseResult cannot be resolved to a type    ExtParser.java
/nutchbase/src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext    line
69    Java Problem
ParseResult cannot be resolved to a type    FeedParser.java
/nutchbase/src/plugin/feed/src/java/org/apache/nutch/parse/feed    line
106    Java Problem
ParseResult cannot be resolved to a type    FeedParser.java
/nutchbase/src/plugin/feed/src/java/org/apache/nutch/parse/feed    line
108    Java Problem
ParseResult cannot be resolved to a type    FeedParser.java
/nutchbase/src/plugin/feed/src/java/org/apache/nutch/parse/feed    line
108    Java Problem
ParseResult cannot be resolved to a type    FeedParser.java
/nutchbase/src/plugin/feed/src/java/org/apache/nutch/parse/feed    line
211    Java Problem
ParseResult cannot be resolved to a type    FeedParser.java
/nutchbase/src/plugin/feed/src/java/org/apache/nutch/parse/feed    line
221    Java Problem
ParseResult cannot be resolved to a type    HTMLLanguageParser.java
/nutchbase/src/plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang
line 90    Java Problem
ParseResult cannot be resolved to a type    HTMLLanguageParser.java
/nutchbase/src/plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang
line 90    Java Problem
ParseResult cannot be resolved to a type    MSBaseParser.java
/nutchbase/src/plugin/lib-parsems/src/java/org/apache/nutch/parse/ms    line
64    Java Problem
ParseResult cannot be resolved to a type    MSExcelParser.java
/nutchbase/src/plugin/parse-msexcel/src/java/org/apache/nutch/parse/msexcel
line 40    Java Problem
ParseResult cannot be resolved to a type    MSPowerPointParser.java
/nutchbase/src/plugin/parse-mspowerpoint/src/java/org/apache/nutch/parse/mspowerpoint
line 44    Java Problem
ParseResult cannot be resolved to a type    MSWordParser.java
/nutchbase/src/plugin/parse-msword/src/java/org/apache/nutch/parse/msword
line 43    Java Problem
ParseResult cannot be resolved to a type    OOParser.java
/nutchbase/src/plugin/parse-oo/src/java/org/apache/nutch/parse/oo    line
63    Java Problem
ParseResult cannot be resolved to a type    PdfParser.java
/nutchbase/src/plugin/parse-pdf/src/java/org/apache/nutch/parse/pdf    line
69    Java Problem
ParseResult cannot be resolved to a type    RSSParser.java
/nutchbase/src/plugin/parse-rss/src/java/org/apache/nutch/parse/rss    line
80    Java Problem
ParseResult cannot be resolved to a type    RelTagParser.java
/nutchbase/src/plugin/microformats-reltag/src/java/org/apache/nutch/microformats/reltag
line 68    Java Problem
ParseResult cannot be resolved to a type    RelTagParser.java
/nutchbase/src/plugin/microformats-reltag/src/java/org/apache/nutch/microformats/reltag
line 68    Java Problem
ParseResult cannot be resolved to a type    SWFParser.java
/nutchbase/src/plugin/parse-swf/src/java/org/apache/nutch/parse/swf    line
64    Java Problem
ParseResult cannot be resolved to a type    SWFParser.java
/nutchbase/src/plugin/parse-swf/src/java/org/apache/nutch/parse/swf    line
125    Java Problem
ParseResult cannot be resolved to a type    TestFeedParser.java
/nutchbase/src/plugin/feed/src/test/org/apache/nutch/parse/feed    line
94    Java Problem
ParseResult cannot be resolved to a type    TextParser.java
/nutchbase/src/plugin/parse-text/src/java/org/apache/nutch/parse/text
line 41    Java Problem
ParseResult cannot be resolved to a type    ZipParser.java
/nutchbase/src/plugin/parse-zip/src/java/org/apache/nutch/parse/zip    line
55    Java Problem
The constructor Fetcher(Configuration) is undefined    TestFetcher.java
/nutchbase/src/test/org/apache/nutch/fetcher    line 100    Java Problem
The constructor Fetcher(Configuration) is undefined    TestFetcher.java
/nutchbase/src/test/org/apache/nutch/fetcher    line 177    Java Problem
The constructor Generator(Configuration) is undefined    TestFetcher.java
/nutchbase/src/test/org/apache/nutch/fetcher    line 94    Java Problem
The constructor Generator(Configuration) is undefined
TestGenerator.java    /nutchbase/src/test/org/apache/nutch/crawl    line
312    Java Problem
The constructor Injector(Configuration) is undefined    TestFetcher.java
/nutchbase/src/test/org/apache/nutch/fetcher    line 90    Java Problem
The constructor Injector(Configuration) is undefined    TestInjector.java
/nutchbase/src/test/org/apache/nutch/crawl    line 70    Java Problem
The constructor NutchWritable(ParseImpl) is undefined
ArcSegmentCreator.java    /nutchbase/src/java/org/apache/nutch/tools/arc
line 229    Java Problem
The import org.apache.nutch.fetcher.FetcherOutputFormat cannot be
resolved    ArcSegmentCreator.java
/nutchbase/src/java/org/apache/nutch/tools/arc    line 44    Java Problem
The import org.apache.nutch.parse.ParseImpl cannot be resolved
ArcSegmentCreator.java    /nutchbase/src/java/org/apache/nutch/tools/arc
line 50    Java Problem
The import org.apache.nutch.parse.ParseImpl cannot be resolved
BasicFields.java    /nutchbase/src/java/org/apache/nutch/indexer/field
line 61    Java Problem
The import org.apache.nutch.parse.ParseImpl cannot be resolved
ExtParser.java
/nutchbase/src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext    line
26    Java Problem
The import org.apache.nutch.parse.ParseImpl cannot be resolved
MSBaseParser.java
/nutchbase/src/plugin/lib-parsems/src/java/org/apache/nutch/parse/ms    line
39    Java Problem
The import org.apache.nutch.parse.ParseImpl cannot be resolved
PdfParser.java
/nutchbase/src/plugin/parse-pdf/src/java/org/apache/nutch/parse/pdf    line
41    Java Problem
The import org.apache.nutch.parse.ParseImpl cannot be resolved
RSSParser.java
/nutchbase/src/plugin/parse-rss/src/java/org/apache/nutch/parse/rss    line
41    Java Problem
The import org.apache.nutch.parse.ParseImpl cannot be resolved
TestExtParser.java
/nutchbase/src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext    line
26    Java Problem
The import org.apache.nutch.parse.ParseImpl cannot be resolved
TestIndexingFilters.java    /nutchbase/src/test/org/apache/nutch/indexer
line 26    Java Problem
The import org.apache.nutch.parse.ParseImpl cannot be resolved
TestMSWordParser.java
/nutchbase/src/plugin/parse-msword/src/test/org/apache/nutch/parse/msword
line 26    Java Problem
The import org.apache.nutch.parse.ParseImpl cannot be resolved
TestMoreIndexingFilter.java
/nutchbase/src/plugin/index-more/src/test/org/apache/nutch/indexer/more
line 29    Java Problem
The import org.apache.nutch.parse.ParseImpl cannot be resolved
TestZipParser.java
/nutchbase/src/plugin/parse-zip/src/test/org/apache/nutch/parse/zip    line
26    Java Problem
The import org.apache.nutch.parse.ParseImpl cannot be resolved
ZipParser.java
/nutchbase/src/plugin/parse-zip/src/java/org/apache/nutch/parse/zip    line
33    Java Problem
The import org.apache.nutch.parse.ParseImpl cannot be resolved
ZipTextExtractor.java
/nutchbase/src/plugin/parse-zip/src/java/org/apache/nutch/parse/zip    line
41    Java Problem
The import org.apache.nutch.parse.ParseResult cannot be resolved
ArcSegmentCreator.java    /nutchbase/src/java/org/apache/nutch/tools/arc
line 51    Java Problem
The import org.apache.nutch.parse.ParseResult cannot be resolved
ExtParser.java
/nutchbase/src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext    line
21    Java Problem
The import org.apache.nutch.parse.ParseResult cannot be resolved
FeedParser.java
/nutchbase/src/plugin/feed/src/java/org/apache/nutch/parse/feed    line
43    Java Problem
The import org.apache.nutch.parse.ParseResult cannot be resolved
HTMLLanguageParser.java
/nutchbase/src/plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang
line 33    Java Problem
The import org.apache.nutch.parse.ParseResult cannot be resolved
MSBaseParser.java
/nutchbase/src/plugin/lib-parsems/src/java/org/apache/nutch/parse/ms    line
40    Java Problem
The import org.apache.nutch.parse.ParseResult cannot be resolved
MSExcelParser.java
/nutchbase/src/plugin/parse-msexcel/src/java/org/apache/nutch/parse/msexcel
line 20    Java Problem
The import org.apache.nutch.parse.ParseResult cannot be resolved
MSPowerPointParser.java
/nutchbase/src/plugin/parse-mspowerpoint/src/java/org/apache/nutch/parse/mspowerpoint
line 20    Java Problem
The import org.apache.nutch.parse.ParseResult cannot be resolved
MSWordParser.java
/nutchbase/src/plugin/parse-msword/src/java/org/apache/nutch/parse/msword
line 21    Java Problem
The import org.apache.nutch.parse.ParseResult cannot be resolved
PdfParser.java
/nutchbase/src/plugin/parse-pdf/src/java/org/apache/nutch/parse/pdf    line
37    Java Problem
The import org.apache.nutch.parse.ParseResult cannot be resolved
RSSParser.java
/nutchbase/src/plugin/parse-rss/src/java/org/apache/nutch/parse/rss    line
36    Java Problem
The import org.apache.nutch.parse.ParseResult cannot be resolved
RelTagParser.java
/nutchbase/src/plugin/microformats-reltag/src/java/org/apache/nutch/microformats/reltag
line 38    Java Problem
The import org.apache.nutch.parse.ParseResult cannot be resolved
TestFeedParser.java
/nutchbase/src/plugin/feed/src/test/org/apache/nutch/parse/feed    line
32    Java Problem
The import org.apache.nutch.parse.ParseResult cannot be resolved
ZipParser.java
/nutchbase/src/plugin/parse-zip/src/java/org/apache/nutch/parse/zip    line
34    Java Problem
The method calculate(WebTableRow, Parse) in the type Signature is not
applicable for the arguments (Content, Parse)    ArcSegmentCreator.java
/nutchbase/src/java/org/apache/nutch/tools/arc    line 187    Java Problem
The method calculate(WebTableRow, Parse) in the type Signature is not
applicable for the arguments (Content, Parse)    ArcSegmentCreator.java
/nutchbase/src/java/org/apache/nutch/tools/arc    line 208    Java Problem
The method fetch(String, int, boolean) from the type Fetcher is not
visible    TestFetcher.java
/nutchbase/src/test/org/apache/nutch/fetcher    line 178    Java Problem
The method fetch(String, int, boolean) in the type Fetcher is not applicable
for the arguments (Path, int, boolean)    TestFetcher.java
/nutchbase/src/test/org/apache/nutch/fetcher    line 101    Java Problem
The method generate(String, long, long, boolean) in the type Generator is
not applicable for the arguments (Path, Path, int, int, long, boolean,
boolean)    TestGenerator.java
/nutchbase/src/test/org/apache/nutch/crawl    line 313    Java Problem
The method generate(String, long, long, boolean) in the type Generator is
not applicable for the arguments (Path, Path, int, long, long, boolean,
boolean)    TestFetcher.java
/nutchbase/src/test/org/apache/nutch/fetcher    line 95    Java Problem
The method getData() is undefined for the type Parse
ArcSegmentCreator.java    /nutchbase/src/java/org/apache/nutch/tools/arc
line 200    Java Problem
The method getData() is undefined for the type Parse
ArcSegmentCreator.java    /nutchbase/src/java/org/apache/nutch/tools/arc
line 211    Java Problem
The method getData() is undefined for the type Parse
ArcSegmentCreator.java    /nutchbase/src/java/org/apache/nutch/tools/arc
line 213    Java Problem
The method getData() is undefined for the type Parse
ArcSegmentCreator.java    /nutchbase/src/java/org/apache/nutch/tools/arc
line 216    Java Problem
The method getData() is undefined for the type Parse
ArcSegmentCreator.java    /nutchbase/src/java/org/apache/nutch/tools/arc
line 230    Java Problem
The method getData() is undefined for the type Parse
ArcSegmentCreator.java    /nutchbase/src/java/org/apache/nutch/tools/arc
line 244    Java Problem
The method getData() is undefined for the type Parse    BasicFields.java
/nutchbase/src/java/org/apache/nutch/indexer/field    line 386    Java
Problem
The method getData() is undefined for the type Parse    BasicFields.java
/nutchbase/src/java/org/apache/nutch/indexer/field    line 395    Java
Problem
The method getData() is undefined for the type Parse
CCIndexingFilter.java
/nutchbase/src/plugin/creativecommons/src/java/org/creativecommons/nutch
line 55    Java Problem
The method getData() is undefined for the type Parse
CCParseFilter.java
/nutchbase/src/plugin/creativecommons/src/java/org/creativecommons/nutch
line 280    Java Problem
The method getData() is undefined for the type Parse
CCParseFilter.java
/nutchbase/src/plugin/creativecommons/src/java/org/creativecommons/nutch
line 286    Java Problem
The method getData() is undefined for the type Parse
CCParseFilter.java
/nutchbase/src/plugin/creativecommons/src/java/org/creativecommons/nutch
line 291    Java Problem
The method getData() is undefined for the type Parse
FeedIndexingFilter.java
/nutchbase/src/plugin/feed/src/java/org/apache/nutch/indexer/feed    line
76    Java Problem

Re: State of nutchbase

Posted by xiao yang <ya...@gmail.com>.
So all components such as Injector, Generator, Fetcher, Indexer will read
table name from this mapping file?
The commands will be different from the current version.

2009/12/8 Doğacan Güney <do...@gmail.com>

> Hey everyone,
>
> So I restarted nutchbase efforts with adding an abstraction to the hbase
> api. The idea is to use an intermediate nutch api (which then talks with
> hbase) instead of communicating with hbase directly. This allows us a) to
> not be completely tied down to hbase, making a move to another db in the
> future easier b) perhaps to immediately support multiple databases with easy
> data migration between them.
>
> What I have is very very (VERY) early and extremely alpha but I am quite
> happy with overall idea so I am sharing it for suggestions and reviews.
> Again, instead of using hbase directly, nutch will use a nice java bean with
> getters and setters. Nutch will then figure out what to read/write into
> hbase.
>
> I decided to use avro because it has a very clean design. Here is a  very
> basic WebTableRow class:
> {"namespace": "org.apache.nutch.storage",
>  "protocol": "Web",
>
>  "types": [
>      {"name": "WebTableRow", "type": "record",
>       "fields": [
>           {"name": "rowKey", "type": "string"},
>           {"name": "fetchTime", "type": "long"},
>           {"name": "title", "type": "string"},
>           {"name": "text", "type": "string"},
>           {"name": "status", "type": "int"}
>       ]
>      }
>  ]
> }
>
> (ignore "protocol". I haven't yet figured out how to compile schemas
> without protocols)
>
> I have copied and modified avro's SpecificCompiler to generate a java
> class. It is mostly the same class as avro's SpecificCompiler however the
> variables are all private and are accessed through getters and setters. Here
> is a portion of the file:
>
> public class WebTableRow extends NutchTableRow< Utf8> implements
> SpecificRecord {
>   @RowKey // these are used for reflection
>   private Utf8 rowKey;
>   @RowField
>   private long fetchTime;
>   @RowField
>   private Utf8 title;
>   @RowField
>   private Utf8 text;
>   @RowField
>   private int status;
>   public Utf8 getRowKey() { .... }
>   public void setRowKey(Utf8 value) {....}
>   public long getFetchTime() { .... }
>   public void setFetchTime(long value) { .... }
>   .....
>
> Note that NutchTableRow extends SpecificRecordBase so this is a proper avro
> record. In the future, once hadoop MR supports avro as a serialization
> format NutchTableRow-s can easily be output through maps and reduces which
> is a nice bonus.
>
> We need to force the usage of setters instead of direct access to
> variables. Because one of the nice things about hbase is that you only
> update the columns that you changed. However to know which fields are
> updated (and thus, map them to hbase columns), we must keep track of what
> changed. Currently, NutchTableRow keeps a BitSet for all fields and all
> setter functions update this BitSet so we know exactly what changed.
>
> There is also a new interface called NutchSerializer that defines readRow
> and writeRow methods(it also needs scans, delete rows etc.. but that's for
> later). Currently HbaseSerializer implements NutchSerializer and reads and
> writes WebTableRow-s. HbaseSerializer currently works via reflection. It
> should be easy to add code generation to our SpecificCompiler so that we can
> also output a WebTableRowHbaseSerializer along with WebTableRow instead of
> using reflection.
>
> What I have currently can read and write primitive types + strings into and
> from hbase. You can check it out from github.com/dogacan/nutchbase (branch
> master, package o.a.n.storage). Again, I would like to note that the code is
> very very alpha and is not in a good shape but it should be a good starting
> point if you are interested.
>
> Once hbase support is solid, I intend to add support for other databases
> (bdb, cassandra and sql come to mind). If I got everything right, then
> moving data from one database to another is an incredibly trivial task. So,
> you can start with, say, bdb then switch over to hbase once your data gets
> large.
>
> Oh I forgot... HbaseSerializer reads a hbase-mapping.xml file that
> describes the mapping between fields and hbase columns:
>
> <table name="webtable" class="org.apache.nutch.storage.WebTableRow">
>   <description>
>     <family name="p"/> <!-- This can also have params like compression,
> bloom filters -->
>     <family name="f"/>
>   </description>
>   <fields>
>     <field name="fetchTime" family="f" qualifier="ts"/>
>     <field name="title" family="p" qualifier="t"/>
>     <field name="text" family="p" qualifier="c"/>
>     <field name="status" family="f" qualifier="st"/>
>   </fields>
>
> Sorry for the long and rambling email. Feel free to ask if anything is
> unclear (and I assume it must be, given my incoherent description :)
> --
> Doğacan Güney
>
>

Re: State of nutchbase

Posted by Doğacan Güney <do...@gmail.com>.
Hey everyone,

So I restarted nutchbase efforts with adding an abstraction to the hbase
api. The idea is to use an intermediate nutch api (which then talks with
hbase) instead of communicating with hbase directly. This allows us a) to
not be completely tied down to hbase, making a move to another db in the
future easier b) perhaps to immediately support multiple databases with easy
data migration between them.

What I have is very very (VERY) early and extremely alpha but I am quite
happy with overall idea so I am sharing it for suggestions and reviews.
Again, instead of using hbase directly, nutch will use a nice java bean with
getters and setters. Nutch will then figure out what to read/write into
hbase.

I decided to use avro because it has a very clean design. Here is a  very
basic WebTableRow class:
{"namespace": "org.apache.nutch.storage",
 "protocol": "Web",

 "types": [
     {"name": "WebTableRow", "type": "record",
      "fields": [
          {"name": "rowKey", "type": "string"},
          {"name": "fetchTime", "type": "long"},
          {"name": "title", "type": "string"},
          {"name": "text", "type": "string"},
          {"name": "status", "type": "int"}
      ]
     }
 ]
}

(ignore "protocol". I haven't yet figured out how to compile schemas without
protocols)

I have copied and modified avro's SpecificCompiler to generate a java class.
It is mostly the same class as avro's SpecificCompiler however the variables
are all private and are accessed through getters and setters. Here is a
portion of the file:

public class WebTableRow extends NutchTableRow< Utf8> implements
SpecificRecord {
  @RowKey // these are used for reflection
  private Utf8 rowKey;
  @RowField
  private long fetchTime;
  @RowField
  private Utf8 title;
  @RowField
  private Utf8 text;
  @RowField
  private int status;
  public Utf8 getRowKey() { .... }
  public void setRowKey(Utf8 value) {....}
  public long getFetchTime() { .... }
  public void setFetchTime(long value) { .... }
  .....

Note that NutchTableRow extends SpecificRecordBase so this is a proper avro
record. In the future, once hadoop MR supports avro as a serialization
format NutchTableRow-s can easily be output through maps and reduces which
is a nice bonus.

We need to force the usage of setters instead of direct access to variables.
Because one of the nice things about hbase is that you only update the
columns that you changed. However to know which fields are updated (and
thus, map them to hbase columns), we must keep track of what changed.
Currently, NutchTableRow keeps a BitSet for all fields and all setter
functions update this BitSet so we know exactly what changed.

There is also a new interface called NutchSerializer that defines readRow
and writeRow methods(it also needs scans, delete rows etc.. but that's for
later). Currently HbaseSerializer implements NutchSerializer and reads and
writes WebTableRow-s. HbaseSerializer currently works via reflection. It
should be easy to add code generation to our SpecificCompiler so that we can
also output a WebTableRowHbaseSerializer along with WebTableRow instead of
using reflection.

What I have currently can read and write primitive types + strings into and
from hbase. You can check it out from github.com/dogacan/nutchbase (branch
master, package o.a.n.storage). Again, I would like to note that the code is
very very alpha and is not in a good shape but it should be a good starting
point if you are interested.

Once hbase support is solid, I intend to add support for other databases
(bdb, cassandra and sql come to mind). If I got everything right, then
moving data from one database to another is an incredibly trivial task. So,
you can start with, say, bdb then switch over to hbase once your data gets
large.

Oh I forgot... HbaseSerializer reads a hbase-mapping.xml file that describes
the mapping between fields and hbase columns:

<table name="webtable" class="org.apache.nutch.storage.WebTableRow">
  <description>
    <family name="p"/> <!-- This can also have params like compression,
bloom filters -->
    <family name="f"/>
  </description>
  <fields>
    <field name="fetchTime" family="f" qualifier="ts"/>
    <field name="title" family="p" qualifier="t"/>
    <field name="text" family="p" qualifier="c"/>
    <field name="status" family="f" qualifier="st"/>
  </fields>

Sorry for the long and rambling email. Feel free to ask if anything is
unclear (and I assume it must be, given my incoherent description :)
-- 
Doğacan Güney

Re: State of nutchbase

Posted by Andrzej Bialecki <ab...@getopt.org>.
Alban Mouton wrote:
> Hello,
> 
> I have looked a little into nutch code and mailing lists. I think the 
> nutchbase branch (http://issues.apache.org/jira/browse/NUTCH-650) is 
> very interesting, with a good potential to improve code clarity and 
> flexibility (I find data structure quite obscure in current version). 
> The issue is untouched since last august, so my question is : can 
> nutchbase really be part of nutch 1.1 ? 

Definitely no. Release 1.1 will be an update to 1.0, with no major 
design changes. However, we intend to integrate the nutchbase branch 
with trunk at some point - but since this would be a major change it 
would come under 2.0 branch or so ...


> Is there still much work to do 
> or is it almost ready ? Is it a worthy issue for an interested developer 
> with a (still !) limited knowledge of the project ?

Please contact Dogacan, who is leading the work on this branch. AFAIK 
he's going to update the design soon.

> 
> So far I have only tried to run nutchbase in eclipse by applying the 
> tutorial (http://wiki.apache.org/nutch/RunNutchInEclipse1.0) but I run 
> in errors when building, mostly from Parser and tests. I may start by 
> cleaning this up.

See above - please coordinate with Dogacan to avoid duplication of effort.

-- 
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com