You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by "Doğacan Güney (JIRA)" <ji...@apache.org> on 2009/08/17 00:28:15 UTC

[jira] Commented: (NUTCH-650) Hbase Integration

    [ https://issues.apache.org/jira/browse/NUTCH-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12743919#action_12743919 ] 

Doğacan Güney commented on NUTCH-650:
-------------------------------------

I just committed code to branch nutchbase. The scoring API did not turn out as clean as I expected but I decided to put in what I have. Also, I made some changes so that web UI also works.

I am leaving this issue open because I will add documentation tomorrow. Meanwhile,

To download: 

  svn co http://svn.apache.org/repos/asf/lucene/nutch/branches/nutchbase

Usage:

After starting hbase 0.20 (checkout rev. 804408 from hbase branch 0.20), create a webtable with

  bin/nutch createtable webtable

After that, usage is similar.

  bin/nutch inject webtable url_dir # inject urls

for as many cycles as you want;
    bin/nutch generate webtable #-topN N works
    bin/nutch fetch webtable # -threads N works
    bin/nutch parse webtable
    bin/nutch updatetable webtable

  bin/nutch index <index> webtable
or
  bin/nutch solrindex <solr url> webtable

To use solr, use this schema file
http://www.ceng.metu.edu.tr/~e1345172/schema.xml


Again, a note of warning: This is extremely new code. I hope people will test and use it but there is no guarantee that it will work :)


> Hbase Integration
> -----------------
>
>                 Key: NUTCH-650
>                 URL: https://issues.apache.org/jira/browse/NUTCH-650
>             Project: Nutch
>          Issue Type: New Feature
>    Affects Versions: 1.0.0
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>             Fix For: 1.1
>
>         Attachments: hbase-integration_v1.patch, hbase_v2.patch, malformedurl.patch, meta.patch, meta2.patch, nofollow-hbase.patch, nutch-habase.patch, searching.diff, slash.patch
>
>
> This issue will track nutch/hbase integration

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Commented: (NUTCH-650) Hbase Integration

Posted by xiao yang <ya...@gmail.com>.
Update: hadoop and hbase jar version is not right. After updating jars in
'lib/' directory and rebuild, now it's throwing:

org.apache.hadoop.hbase.regionserver.NoSuchColumnFamilyException:
org.apache.hadoop.hbase.regionserver.NoSuchColumnFamilyException: Column
family mtdt: does not exist in region crawl,,1264048608430 in table {NAME =>
'crawl', FAMILIES => [{NAME => 'bas', COMPRESSION => 'NONE', VERSIONS =>
'3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false',
BLOCKCACHE => 'true'}, {NAME => 'cnt', COMPRESSION => 'NONE', VERSIONS =>
'3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false',
BLOCKCACHE => 'true'}, {NAME => 'cnttyp', COMPRESSION => 'NONE', VERSIONS =>
'3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false',
BLOCKCACHE => 'true'}, {NAME => 'fchi', COMPRESSION => 'NONE', VERSIONS =>
'3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false',
BLOCKCACHE => 'true'}, {NAME => 'fcht', COMPRESSION => 'NONE', VERSIONS =>
'3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false',
BLOCKCACHE => 'true'}, {NAME => 'hdrs', COMPRESSION => 'NONE', VERSIONS =>
'3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false',
BLOCKCACHE => 'true'}, {NAME => 'ilnk', COMPRESSION => 'NONE', VERSIONS =>
'3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false',
BLOCKCACHE => 'true'}, {NAME => 'modt', COMPRESSION => 'NONE', VERSIONS =>
'3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false',
BLOCKCACHE => 'true'}, {NAME => 'mtdt', COMPRESSION => 'NONE', VERSIONS =>
'3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false',
BLOCKCACHE => 'true'}, {NAME => 'olnk', COMPRESSION => 'NONE', VERSIONS =>
'3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false',
BLOCKCACHE => 'true'}, {NAME => 'prsstt', COMPRESSION => 'NONE', VERSIONS =>
'3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false',
BLOCKCACHE => 'true'}, {NAME => 'prtstt', COMPRESSION => 'NONE', VERSIONS =>
'3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false',
BLOCKCACHE => 'true'}, {NAME => 'prvfch', COMPRESSION => 'NONE', VERSIONS =>
'3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false',
BLOCKCACHE => 'true'}, {NAME => 'prvsig', COMPRESSION => 'NONE', VERSIONS =>
'3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false',
BLOCKCACHE => 'true'}, {NAME => 'repr', COMPRESSION => 'NONE', VERSIONS =>
'3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false',
BLOCKCACHE => 'true'}, {NAME => 'rtrs', COMPRESSION => 'NONE', VERSIONS =>
'3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false',
BLOCKCACHE => 'true'}, {NAME => 'scr', COMPRESSION => 'NONE', VERSIONS =>
'3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false',
BLOCKCACHE => 'true'}, {NAME => 'sig', COMPRESSION => 'NONE', VERSIONS =>
'3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false',
BLOCKCACHE => 'true'}, {NAME => 'stt', COMPRESSION => 'NONE', VERSIONS =>
'3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false',
BLOCKCACHE => 'true'}, {NAME => 'ttl', COMPRESSION => 'NONE', VERSIONS =>
'3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false',
BLOCKCACHE => 'true'}, {NAME => 'txt', COMPRESSION => 'NONE', VERSIONS =>
'3', TTL => '2147483647', BLOCKSIZE => '65536', IN_MEMORY => 'false',
BLOCKCACHE => 'true'}]}
    at
org.apache.hadoop.hbase.regionserver.HRegion.checkFamily(HRegion.java:2381)
    at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1241)
    at org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:1208)
    at
org.apache.hadoop.hbase.regionserver.HRegionServer.put(HRegionServer.java:1834)
    at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
    at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
    at java.lang.reflect.Method.invoke(Method.java:597)
    at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:648)
    at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)

    at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
    at
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
    at
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
    at
org.apache.hadoop.hbase.RemoteExceptionHandler.decodeRemoteException(RemoteExceptionHandler.java:94)
    at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.getRegionServerWithRetries(HConnectionManager.java:995)
    at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers$2.doCall(HConnectionManager.java:1193)
    at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers$Batch.process(HConnectionManager.java:1115)
    at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.processBatchOfRows(HConnectionManager.java:1201)
    at org.apache.hadoop.hbase.client.HTable.flushCommits(HTable.java:605)
    at org.apache.hadoop.hbase.client.HTable.put(HTable.java:470)
    at org.apache.nutch.crawl.Injector$UrlMapper.map(Injector.java:92)
    at org.apache.nutch.crawl.Injector$UrlMapper.map(Injector.java:62)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:583)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:305)
    at org.apache.hadoop.mapred.Child.main(Child.java:170)

It's wield for the error message shows the Column family 'mtdt' does exist.


On Tue, Jan 12, 2010 at 3:43 PM, xiao yang <ya...@gmail.com> wrote:

> Hi, Doğacan
>
> I have checked out Nutchbase from
> http://svn.apache.org/repos/asf/lucene/nutch/branches/nutchbase/
> My Hbase version is 0.20.2.
>
> createtable succeeded, but inject doesn't work.
>
> $bin/nutch createtable *crawl*
>
> Here is the status of Hbase:
> hbase(main):014:0> list
> 10/01/12 15:37:43 DEBUG client.HConnectionManager$TableServers: Cache hit
> for row <> in tableName .META.: location server 10.214.10.146:34592,
> location region name .META.,,1
> *crawl*
>
> 1 row(s) in 0.0110 seconds
>
> $bin/nutch inject crawl urls
> Injector: starting
> Injector: urlDir: urls
> Injecting new users failed!
>
> Here is the log:
>
> 2010-01-12 15:38:57,515 WARN  mapred.LocalJobRunner - job_local_0001
> java.lang.reflect.UndeclaredThrowableException
>     at $Proxy0.getRegionInfo(Unknown Source)
>     at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:874)
>     at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:515)
>     at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:491)
>     at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:565)
>     at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:524)
>     at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:491)
>     at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:565)
>     at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:528)
>     at
> org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:491)
>     at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:123)
>     at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:101)
>     at org.apache.nutch.crawl.Injector$UrlMapper.setup(Injector.java:102)
>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
>     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:518)
>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:303)
>     at
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:176)
> Caused by: org.apache.hadoop.ipc.RemoteException: java.io.IOException:
> java.lang.NullPointerException
>     at java.lang.Class.searchMethods(Class.java:2646)
>     at java.lang.Class.getMethod0(Class.java:2670)
>     at java.lang.Class.getMethod(Class.java:1603)
>     at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:643)
>     at
> org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)
>
>     at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:720)
>     at
> org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:329)
>     ... 17 more
> 2010-01-12 15:38:57,806 WARN  crawl.Injector - Injecting new users failed!
>
> What's the problem?
> Thanks!
> Xiao
>
> 2009/8/17 Doğacan Güney (JIRA) <ji...@apache.org>:
>
> >
> >    [
> https://issues.apache.org/jira/browse/NUTCH-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12743919#action_12743919]
> >
> > Doğacan Güney commented on NUTCH-650:
> > -------------------------------------
> >
> > I just committed code to branch nutchbase. The scoring API did not turn
> out as clean as I expected but I decided to put in what I have. Also, I made
> some changes so that web UI also works.
> >
> > I am leaving this issue open because I will add documentation tomorrow.
> Meanwhile,
> >
> > To download:
> >
> >  svn co http://svn.apache.org/repos/asf/lucene/nutch/branches/nutchbase
> >
> > Usage:
> >
> > After starting hbase 0.20 (checkout rev. 804408 from hbase branch 0.20),
> create a webtable with
> >
> >  bin/nutch createtable webtable
> >
> > After that, usage is similar.
> >
> >  bin/nutch inject webtable url_dir # inject urls
> >
> > for as many cycles as you want;
> >    bin/nutch generate webtable #-topN N works
> >    bin/nutch fetch webtable # -threads N works
> >    bin/nutch parse webtable
> >    bin/nutch updatetable webtable
> >
> >  bin/nutch index <index> webtable
> > or
> >  bin/nutch solrindex <solr url> webtable
> >
> > To use solr, use this schema file
> > http://www.ceng.metu.edu.tr/~e1345172/schema.xml<http://www.ceng.metu.edu.tr/%7Ee1345172/schema.xml>
> >
> >
> > Again, a note of warning: This is extremely new code. I hope people will
> test and use it but there is no guarantee that it will work :)
> >
> >
> >> Hbase Integration
> >> -----------------
> >>
> >>                 Key: NUTCH-650
> >>                 URL: https://issues.apache.org/jira/browse/NUTCH-650
> >>             Project: Nutch
> >>          Issue Type: New Feature
> >>    Affects Versions: 1.0.0
> >>            Reporter: Doğacan Güney
> >>            Assignee: Doğacan Güney
> >>             Fix For: 1.1
> >>
> >>         Attachments: hbase-integration_v1.patch, hbase_v2.patch,
> malformedurl.patch, meta.patch, meta2.patch, nofollow-hbase.patch,
> nutch-habase.patch, searching.diff, slash.patch
> >>
> >>
> >> This issue will track nutch/hbase integration
> >
> > --
> > This message is automatically generated by JIRA.
> > -
> > You can reply to this email to add a comment to the issue online.
> >
> >
>
>

Re: [jira] Commented: (NUTCH-650) Hbase Integration

Posted by xiao yang <ya...@gmail.com>.
Hi, Doğacan

I have checked out Nutchbase from
http://svn.apache.org/repos/asf/lucene/nutch/branches/nutchbase/
My Hbase version is 0.20.2.

createtable succeeded, but inject doesn't work.

$bin/nutch createtable *crawl*

Here is the status of Hbase:
hbase(main):014:0> list
10/01/12 15:37:43 DEBUG client.HConnectionManager$TableServers: Cache hit
for row <> in tableName .META.: location server 10.214.10.146:34592,
location region name .META.,,1
*crawl*

1 row(s) in 0.0110 seconds

$bin/nutch inject crawl urls
Injector: starting
Injector: urlDir: urls
Injecting new users failed!

Here is the log:

2010-01-12 15:38:57,515 WARN  mapred.LocalJobRunner - job_local_0001
java.lang.reflect.UndeclaredThrowableException
    at $Proxy0.getRegionInfo(Unknown Source)
    at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRootRegion(HConnectionManager.java:874)
    at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:515)
    at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:491)
    at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:565)
    at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:524)
    at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:491)
    at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegionInMeta(HConnectionManager.java:565)
    at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:528)
    at
org.apache.hadoop.hbase.client.HConnectionManager$TableServers.locateRegion(HConnectionManager.java:491)
    at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:123)
    at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:101)
    at org.apache.nutch.crawl.Injector$UrlMapper.setup(Injector.java:102)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:142)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:518)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:303)
    at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:176)
Caused by: org.apache.hadoop.ipc.RemoteException: java.io.IOException:
java.lang.NullPointerException
    at java.lang.Class.searchMethods(Class.java:2646)
    at java.lang.Class.getMethod0(Class.java:2670)
    at java.lang.Class.getMethod(Class.java:1603)
    at org.apache.hadoop.hbase.ipc.HBaseRPC$Server.call(HBaseRPC.java:643)
    at
org.apache.hadoop.hbase.ipc.HBaseServer$Handler.run(HBaseServer.java:915)

    at org.apache.hadoop.hbase.ipc.HBaseClient.call(HBaseClient.java:720)
    at
org.apache.hadoop.hbase.ipc.HBaseRPC$Invoker.invoke(HBaseRPC.java:329)
    ... 17 more
2010-01-12 15:38:57,806 WARN  crawl.Injector - Injecting new users failed!

What's the problem?
Thanks!
Xiao

2009/8/17 Doğacan Güney (JIRA) <ji...@apache.org>:
>
>    [
https://issues.apache.org/jira/browse/NUTCH-650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12743919#action_12743919]
>
> Doğacan Güney commented on NUTCH-650:
> -------------------------------------
>
> I just committed code to branch nutchbase. The scoring API did not turn
out as clean as I expected but I decided to put in what I have. Also, I made
some changes so that web UI also works.
>
> I am leaving this issue open because I will add documentation tomorrow.
Meanwhile,
>
> To download:
>
>  svn co http://svn.apache.org/repos/asf/lucene/nutch/branches/nutchbase
>
> Usage:
>
> After starting hbase 0.20 (checkout rev. 804408 from hbase branch 0.20),
create a webtable with
>
>  bin/nutch createtable webtable
>
> After that, usage is similar.
>
>  bin/nutch inject webtable url_dir # inject urls
>
> for as many cycles as you want;
>    bin/nutch generate webtable #-topN N works
>    bin/nutch fetch webtable # -threads N works
>    bin/nutch parse webtable
>    bin/nutch updatetable webtable
>
>  bin/nutch index <index> webtable
> or
>  bin/nutch solrindex <solr url> webtable
>
> To use solr, use this schema file
> http://www.ceng.metu.edu.tr/~e1345172/schema.xml
>
>
> Again, a note of warning: This is extremely new code. I hope people will
test and use it but there is no guarantee that it will work :)
>
>
>> Hbase Integration
>> -----------------
>>
>>                 Key: NUTCH-650
>>                 URL: https://issues.apache.org/jira/browse/NUTCH-650
>>             Project: Nutch
>>          Issue Type: New Feature
>>    Affects Versions: 1.0.0
>>            Reporter: Doğacan Güney
>>            Assignee: Doğacan Güney
>>             Fix For: 1.1
>>
>>         Attachments: hbase-integration_v1.patch, hbase_v2.patch,
malformedurl.patch, meta.patch, meta2.patch, nofollow-hbase.patch,
nutch-habase.patch, searching.diff, slash.patch
>>
>>
>> This issue will track nutch/hbase integration
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>