You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Alban Mouton <al...@gmail.com> on 2009/12/05 15:56:49 UTC
State of nutchbase
Hello,
I have looked a little into nutch code and mailing lists. I think the
nutchbase branch (http://issues.apache.org/jira/browse/NUTCH-650) is very
interesting, with a good potential to improve code clarity and flexibility
(I find data structure quite obscure in current version). The issue is
untouched since last august, so my question is : can nutchbase really be
part of nutch 1.1 ? Is there still much work to do or is it almost ready ?
Is it a worthy issue for an interested developer with a (still !) limited
knowledge of the project ?
So far I have only tried to run nutchbase in eclipse by applying the
tutorial (http://wiki.apache.org/nutch/RunNutchInEclipse1.0) but I run in
errors when building, mostly from Parser and tests. I may start by cleaning
this up.
Eclipse build errors:
Description Resource Path Location Type
FetcherOutputFormat cannot be resolved to a type
ArcSegmentCreator.java /nutchbase/src/java/org/apache/nutch/tools/arc
line 362 Java Problem
Generator.GENERATE_MAX_PER_HOST_BY_IP cannot be resolved
TestGenerator.java /nutchbase/src/test/org/apache/nutch/crawl line
202 Java Problem
ParseImpl cannot be resolved to a type ArcSegmentCreator.java
/nutchbase/src/java/org/apache/nutch/tools/arc line 229 Java Problem
ParseImpl cannot be resolved to a type BasicFields.java
/nutchbase/src/java/org/apache/nutch/indexer/field line 335 Java
Problem
ParseImpl cannot be resolved to a type ExtParser.java
/nutchbase/src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext line
138 Java Problem
ParseImpl cannot be resolved to a type MSBaseParser.java
/nutchbase/src/plugin/lib-parsems/src/java/org/apache/nutch/parse/ms line
108 Java Problem
ParseImpl cannot be resolved to a type OOParser.java
/nutchbase/src/plugin/parse-oo/src/java/org/apache/nutch/parse/oo line
103 Java Problem
ParseImpl cannot be resolved to a type PdfParser.java
/nutchbase/src/plugin/parse-pdf/src/java/org/apache/nutch/parse/pdf line
155 Java Problem
ParseImpl cannot be resolved to a type RSSParser.java
/nutchbase/src/plugin/parse-rss/src/java/org/apache/nutch/parse/rss line
187 Java Problem
ParseImpl cannot be resolved to a type SWFParser.java
/nutchbase/src/plugin/parse-swf/src/java/org/apache/nutch/parse/swf line
113 Java Problem
ParseImpl cannot be resolved to a type TestIndexingFilters.java
/nutchbase/src/test/org/apache/nutch/indexer line 45 Java Problem
ParseImpl cannot be resolved to a type TestMoreIndexingFilter.java
/nutchbase/src/plugin/index-more/src/test/org/apache/nutch/indexer/more
line 61 Java Problem
ParseImpl cannot be resolved to a type TextParser.java
/nutchbase/src/plugin/parse-text/src/java/org/apache/nutch/parse/text
line 55 Java Problem
ParseImpl cannot be resolved to a type ZipParser.java
/nutchbase/src/plugin/parse-zip/src/java/org/apache/nutch/parse/zip line
105 Java Problem
ParseResult cannot be resolved ExtParser.java
/nutchbase/src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext line
137 Java Problem
ParseResult cannot be resolved MSBaseParser.java
/nutchbase/src/plugin/lib-parsems/src/java/org/apache/nutch/parse/ms line
107 Java Problem
ParseResult cannot be resolved OOParser.java
/nutchbase/src/plugin/parse-oo/src/java/org/apache/nutch/parse/oo line
103 Java Problem
ParseResult cannot be resolved PdfParser.java
/nutchbase/src/plugin/parse-pdf/src/java/org/apache/nutch/parse/pdf line
155 Java Problem
ParseResult cannot be resolved RSSParser.java
/nutchbase/src/plugin/parse-rss/src/java/org/apache/nutch/parse/rss line
187 Java Problem
ParseResult cannot be resolved SWFParser.java
/nutchbase/src/plugin/parse-swf/src/java/org/apache/nutch/parse/swf line
113 Java Problem
ParseResult cannot be resolved TextParser.java
/nutchbase/src/plugin/parse-text/src/java/org/apache/nutch/parse/text
line 55 Java Problem
ParseResult cannot be resolved ZipParser.java
/nutchbase/src/plugin/parse-zip/src/java/org/apache/nutch/parse/zip line
105 Java Problem
ParseResult cannot be resolved to a type ArcSegmentCreator.java
/nutchbase/src/java/org/apache/nutch/tools/arc line 159 Java Problem
ParseResult cannot be resolved to a type CCParseFilter.java
/nutchbase/src/plugin/creativecommons/src/java/org/creativecommons/nutch
line 267 Java Problem
ParseResult cannot be resolved to a type CCParseFilter.java
/nutchbase/src/plugin/creativecommons/src/java/org/creativecommons/nutch
line 267 Java Problem
ParseResult cannot be resolved to a type ExtParser.java
/nutchbase/src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext line
69 Java Problem
ParseResult cannot be resolved to a type FeedParser.java
/nutchbase/src/plugin/feed/src/java/org/apache/nutch/parse/feed line
106 Java Problem
ParseResult cannot be resolved to a type FeedParser.java
/nutchbase/src/plugin/feed/src/java/org/apache/nutch/parse/feed line
108 Java Problem
ParseResult cannot be resolved to a type FeedParser.java
/nutchbase/src/plugin/feed/src/java/org/apache/nutch/parse/feed line
108 Java Problem
ParseResult cannot be resolved to a type FeedParser.java
/nutchbase/src/plugin/feed/src/java/org/apache/nutch/parse/feed line
211 Java Problem
ParseResult cannot be resolved to a type FeedParser.java
/nutchbase/src/plugin/feed/src/java/org/apache/nutch/parse/feed line
221 Java Problem
ParseResult cannot be resolved to a type HTMLLanguageParser.java
/nutchbase/src/plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang
line 90 Java Problem
ParseResult cannot be resolved to a type HTMLLanguageParser.java
/nutchbase/src/plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang
line 90 Java Problem
ParseResult cannot be resolved to a type MSBaseParser.java
/nutchbase/src/plugin/lib-parsems/src/java/org/apache/nutch/parse/ms line
64 Java Problem
ParseResult cannot be resolved to a type MSExcelParser.java
/nutchbase/src/plugin/parse-msexcel/src/java/org/apache/nutch/parse/msexcel
line 40 Java Problem
ParseResult cannot be resolved to a type MSPowerPointParser.java
/nutchbase/src/plugin/parse-mspowerpoint/src/java/org/apache/nutch/parse/mspowerpoint
line 44 Java Problem
ParseResult cannot be resolved to a type MSWordParser.java
/nutchbase/src/plugin/parse-msword/src/java/org/apache/nutch/parse/msword
line 43 Java Problem
ParseResult cannot be resolved to a type OOParser.java
/nutchbase/src/plugin/parse-oo/src/java/org/apache/nutch/parse/oo line
63 Java Problem
ParseResult cannot be resolved to a type PdfParser.java
/nutchbase/src/plugin/parse-pdf/src/java/org/apache/nutch/parse/pdf line
69 Java Problem
ParseResult cannot be resolved to a type RSSParser.java
/nutchbase/src/plugin/parse-rss/src/java/org/apache/nutch/parse/rss line
80 Java Problem
ParseResult cannot be resolved to a type RelTagParser.java
/nutchbase/src/plugin/microformats-reltag/src/java/org/apache/nutch/microformats/reltag
line 68 Java Problem
ParseResult cannot be resolved to a type RelTagParser.java
/nutchbase/src/plugin/microformats-reltag/src/java/org/apache/nutch/microformats/reltag
line 68 Java Problem
ParseResult cannot be resolved to a type SWFParser.java
/nutchbase/src/plugin/parse-swf/src/java/org/apache/nutch/parse/swf line
64 Java Problem
ParseResult cannot be resolved to a type SWFParser.java
/nutchbase/src/plugin/parse-swf/src/java/org/apache/nutch/parse/swf line
125 Java Problem
ParseResult cannot be resolved to a type TestFeedParser.java
/nutchbase/src/plugin/feed/src/test/org/apache/nutch/parse/feed line
94 Java Problem
ParseResult cannot be resolved to a type TextParser.java
/nutchbase/src/plugin/parse-text/src/java/org/apache/nutch/parse/text
line 41 Java Problem
ParseResult cannot be resolved to a type ZipParser.java
/nutchbase/src/plugin/parse-zip/src/java/org/apache/nutch/parse/zip line
55 Java Problem
The constructor Fetcher(Configuration) is undefined TestFetcher.java
/nutchbase/src/test/org/apache/nutch/fetcher line 100 Java Problem
The constructor Fetcher(Configuration) is undefined TestFetcher.java
/nutchbase/src/test/org/apache/nutch/fetcher line 177 Java Problem
The constructor Generator(Configuration) is undefined TestFetcher.java
/nutchbase/src/test/org/apache/nutch/fetcher line 94 Java Problem
The constructor Generator(Configuration) is undefined
TestGenerator.java /nutchbase/src/test/org/apache/nutch/crawl line
312 Java Problem
The constructor Injector(Configuration) is undefined TestFetcher.java
/nutchbase/src/test/org/apache/nutch/fetcher line 90 Java Problem
The constructor Injector(Configuration) is undefined TestInjector.java
/nutchbase/src/test/org/apache/nutch/crawl line 70 Java Problem
The constructor NutchWritable(ParseImpl) is undefined
ArcSegmentCreator.java /nutchbase/src/java/org/apache/nutch/tools/arc
line 229 Java Problem
The import org.apache.nutch.fetcher.FetcherOutputFormat cannot be
resolved ArcSegmentCreator.java
/nutchbase/src/java/org/apache/nutch/tools/arc line 44 Java Problem
The import org.apache.nutch.parse.ParseImpl cannot be resolved
ArcSegmentCreator.java /nutchbase/src/java/org/apache/nutch/tools/arc
line 50 Java Problem
The import org.apache.nutch.parse.ParseImpl cannot be resolved
BasicFields.java /nutchbase/src/java/org/apache/nutch/indexer/field
line 61 Java Problem
The import org.apache.nutch.parse.ParseImpl cannot be resolved
ExtParser.java
/nutchbase/src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext line
26 Java Problem
The import org.apache.nutch.parse.ParseImpl cannot be resolved
MSBaseParser.java
/nutchbase/src/plugin/lib-parsems/src/java/org/apache/nutch/parse/ms line
39 Java Problem
The import org.apache.nutch.parse.ParseImpl cannot be resolved
PdfParser.java
/nutchbase/src/plugin/parse-pdf/src/java/org/apache/nutch/parse/pdf line
41 Java Problem
The import org.apache.nutch.parse.ParseImpl cannot be resolved
RSSParser.java
/nutchbase/src/plugin/parse-rss/src/java/org/apache/nutch/parse/rss line
41 Java Problem
The import org.apache.nutch.parse.ParseImpl cannot be resolved
TestExtParser.java
/nutchbase/src/plugin/parse-ext/src/test/org/apache/nutch/parse/ext line
26 Java Problem
The import org.apache.nutch.parse.ParseImpl cannot be resolved
TestIndexingFilters.java /nutchbase/src/test/org/apache/nutch/indexer
line 26 Java Problem
The import org.apache.nutch.parse.ParseImpl cannot be resolved
TestMSWordParser.java
/nutchbase/src/plugin/parse-msword/src/test/org/apache/nutch/parse/msword
line 26 Java Problem
The import org.apache.nutch.parse.ParseImpl cannot be resolved
TestMoreIndexingFilter.java
/nutchbase/src/plugin/index-more/src/test/org/apache/nutch/indexer/more
line 29 Java Problem
The import org.apache.nutch.parse.ParseImpl cannot be resolved
TestZipParser.java
/nutchbase/src/plugin/parse-zip/src/test/org/apache/nutch/parse/zip line
26 Java Problem
The import org.apache.nutch.parse.ParseImpl cannot be resolved
ZipParser.java
/nutchbase/src/plugin/parse-zip/src/java/org/apache/nutch/parse/zip line
33 Java Problem
The import org.apache.nutch.parse.ParseImpl cannot be resolved
ZipTextExtractor.java
/nutchbase/src/plugin/parse-zip/src/java/org/apache/nutch/parse/zip line
41 Java Problem
The import org.apache.nutch.parse.ParseResult cannot be resolved
ArcSegmentCreator.java /nutchbase/src/java/org/apache/nutch/tools/arc
line 51 Java Problem
The import org.apache.nutch.parse.ParseResult cannot be resolved
ExtParser.java
/nutchbase/src/plugin/parse-ext/src/java/org/apache/nutch/parse/ext line
21 Java Problem
The import org.apache.nutch.parse.ParseResult cannot be resolved
FeedParser.java
/nutchbase/src/plugin/feed/src/java/org/apache/nutch/parse/feed line
43 Java Problem
The import org.apache.nutch.parse.ParseResult cannot be resolved
HTMLLanguageParser.java
/nutchbase/src/plugin/languageidentifier/src/java/org/apache/nutch/analysis/lang
line 33 Java Problem
The import org.apache.nutch.parse.ParseResult cannot be resolved
MSBaseParser.java
/nutchbase/src/plugin/lib-parsems/src/java/org/apache/nutch/parse/ms line
40 Java Problem
The import org.apache.nutch.parse.ParseResult cannot be resolved
MSExcelParser.java
/nutchbase/src/plugin/parse-msexcel/src/java/org/apache/nutch/parse/msexcel
line 20 Java Problem
The import org.apache.nutch.parse.ParseResult cannot be resolved
MSPowerPointParser.java
/nutchbase/src/plugin/parse-mspowerpoint/src/java/org/apache/nutch/parse/mspowerpoint
line 20 Java Problem
The import org.apache.nutch.parse.ParseResult cannot be resolved
MSWordParser.java
/nutchbase/src/plugin/parse-msword/src/java/org/apache/nutch/parse/msword
line 21 Java Problem
The import org.apache.nutch.parse.ParseResult cannot be resolved
PdfParser.java
/nutchbase/src/plugin/parse-pdf/src/java/org/apache/nutch/parse/pdf line
37 Java Problem
The import org.apache.nutch.parse.ParseResult cannot be resolved
RSSParser.java
/nutchbase/src/plugin/parse-rss/src/java/org/apache/nutch/parse/rss line
36 Java Problem
The import org.apache.nutch.parse.ParseResult cannot be resolved
RelTagParser.java
/nutchbase/src/plugin/microformats-reltag/src/java/org/apache/nutch/microformats/reltag
line 38 Java Problem
The import org.apache.nutch.parse.ParseResult cannot be resolved
TestFeedParser.java
/nutchbase/src/plugin/feed/src/test/org/apache/nutch/parse/feed line
32 Java Problem
The import org.apache.nutch.parse.ParseResult cannot be resolved
ZipParser.java
/nutchbase/src/plugin/parse-zip/src/java/org/apache/nutch/parse/zip line
34 Java Problem
The method calculate(WebTableRow, Parse) in the type Signature is not
applicable for the arguments (Content, Parse) ArcSegmentCreator.java
/nutchbase/src/java/org/apache/nutch/tools/arc line 187 Java Problem
The method calculate(WebTableRow, Parse) in the type Signature is not
applicable for the arguments (Content, Parse) ArcSegmentCreator.java
/nutchbase/src/java/org/apache/nutch/tools/arc line 208 Java Problem
The method fetch(String, int, boolean) from the type Fetcher is not
visible TestFetcher.java
/nutchbase/src/test/org/apache/nutch/fetcher line 178 Java Problem
The method fetch(String, int, boolean) in the type Fetcher is not applicable
for the arguments (Path, int, boolean) TestFetcher.java
/nutchbase/src/test/org/apache/nutch/fetcher line 101 Java Problem
The method generate(String, long, long, boolean) in the type Generator is
not applicable for the arguments (Path, Path, int, int, long, boolean,
boolean) TestGenerator.java
/nutchbase/src/test/org/apache/nutch/crawl line 313 Java Problem
The method generate(String, long, long, boolean) in the type Generator is
not applicable for the arguments (Path, Path, int, long, long, boolean,
boolean) TestFetcher.java
/nutchbase/src/test/org/apache/nutch/fetcher line 95 Java Problem
The method getData() is undefined for the type Parse
ArcSegmentCreator.java /nutchbase/src/java/org/apache/nutch/tools/arc
line 200 Java Problem
The method getData() is undefined for the type Parse
ArcSegmentCreator.java /nutchbase/src/java/org/apache/nutch/tools/arc
line 211 Java Problem
The method getData() is undefined for the type Parse
ArcSegmentCreator.java /nutchbase/src/java/org/apache/nutch/tools/arc
line 213 Java Problem
The method getData() is undefined for the type Parse
ArcSegmentCreator.java /nutchbase/src/java/org/apache/nutch/tools/arc
line 216 Java Problem
The method getData() is undefined for the type Parse
ArcSegmentCreator.java /nutchbase/src/java/org/apache/nutch/tools/arc
line 230 Java Problem
The method getData() is undefined for the type Parse
ArcSegmentCreator.java /nutchbase/src/java/org/apache/nutch/tools/arc
line 244 Java Problem
The method getData() is undefined for the type Parse BasicFields.java
/nutchbase/src/java/org/apache/nutch/indexer/field line 386 Java
Problem
The method getData() is undefined for the type Parse BasicFields.java
/nutchbase/src/java/org/apache/nutch/indexer/field line 395 Java
Problem
The method getData() is undefined for the type Parse
CCIndexingFilter.java
/nutchbase/src/plugin/creativecommons/src/java/org/creativecommons/nutch
line 55 Java Problem
The method getData() is undefined for the type Parse
CCParseFilter.java
/nutchbase/src/plugin/creativecommons/src/java/org/creativecommons/nutch
line 280 Java Problem
The method getData() is undefined for the type Parse
CCParseFilter.java
/nutchbase/src/plugin/creativecommons/src/java/org/creativecommons/nutch
line 286 Java Problem
The method getData() is undefined for the type Parse
CCParseFilter.java
/nutchbase/src/plugin/creativecommons/src/java/org/creativecommons/nutch
line 291 Java Problem
The method getData() is undefined for the type Parse
FeedIndexingFilter.java
/nutchbase/src/plugin/feed/src/java/org/apache/nutch/indexer/feed line
76 Java Problem
Re: State of nutchbase
Posted by xiao yang <ya...@gmail.com>.
So all components such as Injector, Generator, Fetcher, Indexer will read
table name from this mapping file?
The commands will be different from the current version.
2009/12/8 Doğacan Güney <do...@gmail.com>
> Hey everyone,
>
> So I restarted nutchbase efforts with adding an abstraction to the hbase
> api. The idea is to use an intermediate nutch api (which then talks with
> hbase) instead of communicating with hbase directly. This allows us a) to
> not be completely tied down to hbase, making a move to another db in the
> future easier b) perhaps to immediately support multiple databases with easy
> data migration between them.
>
> What I have is very very (VERY) early and extremely alpha but I am quite
> happy with overall idea so I am sharing it for suggestions and reviews.
> Again, instead of using hbase directly, nutch will use a nice java bean with
> getters and setters. Nutch will then figure out what to read/write into
> hbase.
>
> I decided to use avro because it has a very clean design. Here is a very
> basic WebTableRow class:
> {"namespace": "org.apache.nutch.storage",
> "protocol": "Web",
>
> "types": [
> {"name": "WebTableRow", "type": "record",
> "fields": [
> {"name": "rowKey", "type": "string"},
> {"name": "fetchTime", "type": "long"},
> {"name": "title", "type": "string"},
> {"name": "text", "type": "string"},
> {"name": "status", "type": "int"}
> ]
> }
> ]
> }
>
> (ignore "protocol". I haven't yet figured out how to compile schemas
> without protocols)
>
> I have copied and modified avro's SpecificCompiler to generate a java
> class. It is mostly the same class as avro's SpecificCompiler however the
> variables are all private and are accessed through getters and setters. Here
> is a portion of the file:
>
> public class WebTableRow extends NutchTableRow< Utf8> implements
> SpecificRecord {
> @RowKey // these are used for reflection
> private Utf8 rowKey;
> @RowField
> private long fetchTime;
> @RowField
> private Utf8 title;
> @RowField
> private Utf8 text;
> @RowField
> private int status;
> public Utf8 getRowKey() { .... }
> public void setRowKey(Utf8 value) {....}
> public long getFetchTime() { .... }
> public void setFetchTime(long value) { .... }
> .....
>
> Note that NutchTableRow extends SpecificRecordBase so this is a proper avro
> record. In the future, once hadoop MR supports avro as a serialization
> format NutchTableRow-s can easily be output through maps and reduces which
> is a nice bonus.
>
> We need to force the usage of setters instead of direct access to
> variables. Because one of the nice things about hbase is that you only
> update the columns that you changed. However to know which fields are
> updated (and thus, map them to hbase columns), we must keep track of what
> changed. Currently, NutchTableRow keeps a BitSet for all fields and all
> setter functions update this BitSet so we know exactly what changed.
>
> There is also a new interface called NutchSerializer that defines readRow
> and writeRow methods(it also needs scans, delete rows etc.. but that's for
> later). Currently HbaseSerializer implements NutchSerializer and reads and
> writes WebTableRow-s. HbaseSerializer currently works via reflection. It
> should be easy to add code generation to our SpecificCompiler so that we can
> also output a WebTableRowHbaseSerializer along with WebTableRow instead of
> using reflection.
>
> What I have currently can read and write primitive types + strings into and
> from hbase. You can check it out from github.com/dogacan/nutchbase (branch
> master, package o.a.n.storage). Again, I would like to note that the code is
> very very alpha and is not in a good shape but it should be a good starting
> point if you are interested.
>
> Once hbase support is solid, I intend to add support for other databases
> (bdb, cassandra and sql come to mind). If I got everything right, then
> moving data from one database to another is an incredibly trivial task. So,
> you can start with, say, bdb then switch over to hbase once your data gets
> large.
>
> Oh I forgot... HbaseSerializer reads a hbase-mapping.xml file that
> describes the mapping between fields and hbase columns:
>
> <table name="webtable" class="org.apache.nutch.storage.WebTableRow">
> <description>
> <family name="p"/> <!-- This can also have params like compression,
> bloom filters -->
> <family name="f"/>
> </description>
> <fields>
> <field name="fetchTime" family="f" qualifier="ts"/>
> <field name="title" family="p" qualifier="t"/>
> <field name="text" family="p" qualifier="c"/>
> <field name="status" family="f" qualifier="st"/>
> </fields>
>
> Sorry for the long and rambling email. Feel free to ask if anything is
> unclear (and I assume it must be, given my incoherent description :)
> --
> Doğacan Güney
>
>
Re: State of nutchbase
Posted by Doğacan Güney <do...@gmail.com>.
Hey everyone,
So I restarted nutchbase efforts with adding an abstraction to the hbase
api. The idea is to use an intermediate nutch api (which then talks with
hbase) instead of communicating with hbase directly. This allows us a) to
not be completely tied down to hbase, making a move to another db in the
future easier b) perhaps to immediately support multiple databases with easy
data migration between them.
What I have is very very (VERY) early and extremely alpha but I am quite
happy with overall idea so I am sharing it for suggestions and reviews.
Again, instead of using hbase directly, nutch will use a nice java bean with
getters and setters. Nutch will then figure out what to read/write into
hbase.
I decided to use avro because it has a very clean design. Here is a very
basic WebTableRow class:
{"namespace": "org.apache.nutch.storage",
"protocol": "Web",
"types": [
{"name": "WebTableRow", "type": "record",
"fields": [
{"name": "rowKey", "type": "string"},
{"name": "fetchTime", "type": "long"},
{"name": "title", "type": "string"},
{"name": "text", "type": "string"},
{"name": "status", "type": "int"}
]
}
]
}
(ignore "protocol". I haven't yet figured out how to compile schemas without
protocols)
I have copied and modified avro's SpecificCompiler to generate a java class.
It is mostly the same class as avro's SpecificCompiler however the variables
are all private and are accessed through getters and setters. Here is a
portion of the file:
public class WebTableRow extends NutchTableRow< Utf8> implements
SpecificRecord {
@RowKey // these are used for reflection
private Utf8 rowKey;
@RowField
private long fetchTime;
@RowField
private Utf8 title;
@RowField
private Utf8 text;
@RowField
private int status;
public Utf8 getRowKey() { .... }
public void setRowKey(Utf8 value) {....}
public long getFetchTime() { .... }
public void setFetchTime(long value) { .... }
.....
Note that NutchTableRow extends SpecificRecordBase so this is a proper avro
record. In the future, once hadoop MR supports avro as a serialization
format NutchTableRow-s can easily be output through maps and reduces which
is a nice bonus.
We need to force the usage of setters instead of direct access to variables.
Because one of the nice things about hbase is that you only update the
columns that you changed. However to know which fields are updated (and
thus, map them to hbase columns), we must keep track of what changed.
Currently, NutchTableRow keeps a BitSet for all fields and all setter
functions update this BitSet so we know exactly what changed.
There is also a new interface called NutchSerializer that defines readRow
and writeRow methods(it also needs scans, delete rows etc.. but that's for
later). Currently HbaseSerializer implements NutchSerializer and reads and
writes WebTableRow-s. HbaseSerializer currently works via reflection. It
should be easy to add code generation to our SpecificCompiler so that we can
also output a WebTableRowHbaseSerializer along with WebTableRow instead of
using reflection.
What I have currently can read and write primitive types + strings into and
from hbase. You can check it out from github.com/dogacan/nutchbase (branch
master, package o.a.n.storage). Again, I would like to note that the code is
very very alpha and is not in a good shape but it should be a good starting
point if you are interested.
Once hbase support is solid, I intend to add support for other databases
(bdb, cassandra and sql come to mind). If I got everything right, then
moving data from one database to another is an incredibly trivial task. So,
you can start with, say, bdb then switch over to hbase once your data gets
large.
Oh I forgot... HbaseSerializer reads a hbase-mapping.xml file that describes
the mapping between fields and hbase columns:
<table name="webtable" class="org.apache.nutch.storage.WebTableRow">
<description>
<family name="p"/> <!-- This can also have params like compression,
bloom filters -->
<family name="f"/>
</description>
<fields>
<field name="fetchTime" family="f" qualifier="ts"/>
<field name="title" family="p" qualifier="t"/>
<field name="text" family="p" qualifier="c"/>
<field name="status" family="f" qualifier="st"/>
</fields>
Sorry for the long and rambling email. Feel free to ask if anything is
unclear (and I assume it must be, given my incoherent description :)
--
Doğacan Güney
Re: State of nutchbase
Posted by Andrzej Bialecki <ab...@getopt.org>.
Alban Mouton wrote:
> Hello,
>
> I have looked a little into nutch code and mailing lists. I think the
> nutchbase branch (http://issues.apache.org/jira/browse/NUTCH-650) is
> very interesting, with a good potential to improve code clarity and
> flexibility (I find data structure quite obscure in current version).
> The issue is untouched since last august, so my question is : can
> nutchbase really be part of nutch 1.1 ?
Definitely no. Release 1.1 will be an update to 1.0, with no major
design changes. However, we intend to integrate the nutchbase branch
with trunk at some point - but since this would be a major change it
would come under 2.0 branch or so ...
> Is there still much work to do
> or is it almost ready ? Is it a worthy issue for an interested developer
> with a (still !) limited knowledge of the project ?
Please contact Dogacan, who is leading the work on this branch. AFAIK
he's going to update the design soon.
>
> So far I have only tried to run nutchbase in eclipse by applying the
> tutorial (http://wiki.apache.org/nutch/RunNutchInEclipse1.0) but I run
> in errors when building, mostly from Parser and tests. I may start by
> cleaning this up.
See above - please coordinate with Dogacan to avoid duplication of effort.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com