You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@nutch.apache.org by Adriana Farina <ad...@gmail.com> on 2013/01/31 12:03:29 UTC

Nutch 2.0 and HBase 0.90.4

Hello,

I've set up a cluster of 4 machines with Hadoop 1.0.4 and I'm trying to run
nutch 2.0 in distributed mode using HBase 0.90.4 to store crawling
informations.
I've followed the tutorial
Nutch2Tutorial<https://wiki.apache.org/nutch/Nutch2Tutorial> and
configured
HBase following the guide http://hbase.apache.org/book/quickstart.html.
However, when I try to run nutch, the crawling process runs for a little
bit and then I get the following exception:


org.apache.gora.util.GoraException:
org.apache.hadoop.hbase.MasterNotRunningException: master:60000
        at
org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167)
        at
org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:118)
        at
org.apache.gora.mapreduce.GoraOutputFormat.getRecordWriter(GoraOutputFormat.java:88)
        at
org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>(MapTask.java:628)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:753)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
        at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
        at org.apache.hadoop.mapred.Child.main(Child.java:249)
Caused by: org.apache.hadoop.hbase.MasterNotRunningException: master:60000
        at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getMaster(HConnectionManager.java:396)
        at
org.apache.hadoop.hbase.client.HBaseAdmin.<init>(HBaseAdmin.java:94)
        at
org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:108)
        at
org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)
        at
org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)
        ... 10 more


After that, the crawling process keeps on running, but after some
map\reduce cycles it outputs that exception again and so on...
The strange thing is that the hbase master is up and running: there is no
error in the log files and I can access http://localhost:60010/ with no
problem.

My hbase-site.xml is:


 <property>
    <name>hbase.master</name>
    <value>crawler1a:60000</value>
    <description>The host and port that the HBase master runs
at.</description>
  </property>

  <property>
    <name>hbase.rootdir</name>
    <value>hdfs://*master ip address*:54310/hbase</value>
    <description>The directory shared by region servers.</description>
  </property>

  <property>
    <name>hbase.cluster.distributed</name>
    <value>true</value>
    <description>The mode the cluster will be in. Possible values are
      false: standalone and pseudo-distributed setups with managed Zookeeper
      true: fully-distributed with unmanaged Zookeeper Quorum (see
hbase-env.sh)
    </description>
  </property>

 <!--<property>
    <name>hbase.zookeeper.quorum</name>
    <value>*master ip address*</value>
 </property>-->

  <property>
    <name>hbase.zookeeper.property.dataDir</name>
    <value>/usr/local/hbase-0.90.4/zookeeper_data</value>
  </property>

    <property>
      <name>hbase.zookeeper.quorum</name>
      <value>*cluster machines addresses*</value>
      <description>Comma separated list of servers in the ZooKeeper Quorum.
      For example, "host1.mydomain.com,host2.mydomain.com,host3.mydomain.com
".
      By default this is set to localhost for local and pseudo-distributed
modes
      of operation. For a fully-distributed setup, this should be set to a
full
      list of ZooKeeper quorum servers. If HBASE_MANAGES_ZK is set in
hbase-env.sh
      this is the list of servers which we will start/stop ZooKeeper on.
      </description>
    </property>

<property>
    <name>zookeeper.session.timeout</name>
    <value>30000</value>
    <description>ZooKeeper session timeout.
        HBase passes this to the zk quorum as suggested maximum time for a
        session (This setting becomes zookeeper's 'maxSessionTimeout').  See

http://hadoop.apache.org/zookeeper/docs/current/zookeeperProgrammers.html#ch_zkSessions
        "The client sends a requested timeout, the server responds with the
        timeout that it can give the client. " In milliseconds.
    </description>
  </property>

</configuration>



Searching on google, I've found that it can be an issue due to /etc/hosts,
but it's correctly configured:

127.0.0.1               crawler1a localhost.localdomain localhost

where crawler1a is the master machine both for hadoop and for hbase.


Anybody can help?

Thank you very much.

-- 
Adriana Farina

Re: Nutch 2.0 and HBase 0.90.4

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Adriana,
Thanks for the update, I've added the solution to our wiki for others
to consult in the future

http://s.apache.org/jcs

Thank you for getting back to us on this one.
Lewis

On Mon, Feb 4, 2013 at 2:18 AM, Adriana Farina
<ad...@gmail.com> wrote:
> I solved my issue and I want to write how I solved it in case somebody else
> runs into the same problem.
>
> */etc/hosts* was not configured properly: I had to configure it as
> described in [0]. For each machine of my cluster, I had to comment the line
> *127.0.0.1 localhost* and add *localhost *to the line where my master's
> address was written.
>
>
>
> [0]
> http://stackoverflow.com/questions/7791788/hbase-client-do-not-able-to-connect-with-remote-hbase-server

Re: Nutch 2.0 and HBase 0.90.4

Posted by Adriana Farina <ad...@gmail.com>.

I solved my issue and I want to write how I solved it in case somebody else
runs into the same problem.

*/etc/hosts* was not configured properly: I had to configure it as
described in [0]. For each machine of my cluster, I had to comment the line
*127.0.0.1 localhost* and add *localhost *to the line where my master's
address was written.



[0]
http://stackoverflow.com/questions/7791788/hbase-client-do-not-able-to-connect-with-remote-hbase-server



2013/1/31 Adriana Farina <ad...@gmail.com>

> Hello,
>
> I've set up a cluster of 4 machines with Hadoop 1.0.4 and I'm trying to
> run nutch 2.0 in distributed mode using HBase 0.90.4 to store crawling
> informations.
> I've followed the tutorial Nutch2Tutorial<https://wiki.apache.org/nutch/Nutch2Tutorial> and configured
> HBase following the guide http://hbase.apache.org/book/quickstart.html.
> However, when I try to run nutch, the crawling process runs for a little
> bit and then I get the following exception:
>
>
> org.apache.gora.util.GoraException:
> org.apache.hadoop.hbase.MasterNotRunningException: master:60000
>         at
> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:167)
>         at
> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:118)
>         at
> org.apache.gora.mapreduce.GoraOutputFormat.getRecordWriter(GoraOutputFormat.java:88)
>         at
> org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.<init>(MapTask.java:628)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:753)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>         at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:396)
>         at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1121)
>         at org.apache.hadoop.mapred.Child.main(Child.java:249)
> Caused by: org.apache.hadoop.hbase.MasterNotRunningException: master:60000
>         at
> org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getMaster(HConnectionManager.java:396)
>         at
> org.apache.hadoop.hbase.client.HBaseAdmin.<init>(HBaseAdmin.java:94)
>         at
> org.apache.gora.hbase.store.HBaseStore.initialize(HBaseStore.java:108)
>         at
> org.apache.gora.store.DataStoreFactory.initializeDataStore(DataStoreFactory.java:102)
>         at
> org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:161)
>         ... 10 more
>
>
> After that, the crawling process keeps on running, but after some
> map\reduce cycles it outputs that exception again and so on...
> The strange thing is that the hbase master is up and running: there is no
> error in the log files and I can access http://localhost:60010/ with no
> problem.
>
> My hbase-site.xml is:
>
>
>  <property>
>     <name>hbase.master</name>
>     <value>crawler1a:60000</value>
>     <description>The host and port that the HBase master runs
> at.</description>
>   </property>
>
>   <property>
>     <name>hbase.rootdir</name>
>     <value>hdfs://*master ip address*:54310/hbase</value>
>     <description>The directory shared by region servers.</description>
>   </property>
>
>   <property>
>     <name>hbase.cluster.distributed</name>
>     <value>true</value>
>     <description>The mode the cluster will be in. Possible values are
>       false: standalone and pseudo-distributed setups with managed
> Zookeeper
>       true: fully-distributed with unmanaged Zookeeper Quorum (see
> hbase-env.sh)
>     </description>
>   </property>
>
>  <!--<property>
>     <name>hbase.zookeeper.quorum</name>
>     <value>*master ip address*</value>
>  </property>-->
>
>   <property>
>     <name>hbase.zookeeper.property.dataDir</name>
>     <value>/usr/local/hbase-0.90.4/zookeeper_data</value>
>   </property>
>
>     <property>
>       <name>hbase.zookeeper.quorum</name>
>       <value>*cluster machines addresses*</value>
>       <description>Comma separated list of servers in the ZooKeeper Quorum.
>       For example, "host1.mydomain.com,host2.mydomain.com,
> host3.mydomain.com".
>        By default this is set to localhost for local and
> pseudo-distributed modes
>       of operation. For a fully-distributed setup, this should be set to a
> full
>       list of ZooKeeper quorum servers. If HBASE_MANAGES_ZK is set in
> hbase-env.sh
>       this is the list of servers which we will start/stop ZooKeeper on.
>       </description>
>     </property>
>
> <property>
>     <name>zookeeper.session.timeout</name>
>     <value>30000</value>
>     <description>ZooKeeper session timeout.
>         HBase passes this to the zk quorum as suggested maximum time for a
>         session (This setting becomes zookeeper's 'maxSessionTimeout').
>  See
>
> http://hadoop.apache.org/zookeeper/docs/current/zookeeperProgrammers.html#ch_zkSessions
>         "The client sends a requested timeout, the server responds with the
>         timeout that it can give the client. " In milliseconds.
>     </description>
>   </property>
>
> </configuration>
>
>
>
> Searching on google, I've found that it can be an issue due to /etc/hosts,
> but it's correctly configured:
>
> 127.0.0.1               crawler1a localhost.localdomain localhost
>
> where crawler1a is the master machine both for hadoop and for hbase.
>
>
> Anybody can help?
>
> Thank you very much.
>
> --
> Adriana Farina
>



-- 
Adriana Farina

Re: Nutch 2.0 and HBase 0.90.4

Posted by Adriana Farina <ad...@gmail.com>.

Hi Lewis,

thank you for your answer.

I read the thread you suggested, but I wasn't able to solve my problem.
Everything under hbase/conf and /etc/hosts seems to be correctly
configured.


I'll try to ask on HBase list too.


Thank you again!


2013/1/31 Lewis John Mcgibbney <le...@gmail.com>

> Hi Adriana,
>
> On Thu, Jan 31, 2013 at 3:03 AM, Adriana Farina
> <ad...@gmail.com> wrote:
>
> > Searching on google, I've found that it can be an issue due to
> /etc/hosts,
> > but it's correctly configured:
> >
> > 127.0.0.1               crawler1a localhost.localdomain localhost
> >
> > where crawler1a is the master machine both for hadoop and for hbase.
>
> Not being an HBase guru, my help here is limited at best.
> 1) Head over to HBase lists, this looks most likely nothing to do with
> Nutch.
> 2) I take it, that the thread you refer to is this one [0]? Which
> suggests that your regionservers under hbase/conf and /etc/hosts
> should be matched.
>
> lewis
>
> [0]
> http://stackoverflow.com/questions/12927997/nutch-2-1-cannot-setup-in-mac
>
> --
> Lewis
>



-- 
Adriana Farina

Re: Nutch 2.0 and HBase 0.90.4

Posted by Lewis John Mcgibbney <le...@gmail.com>.

Hi Adriana,

On Thu, Jan 31, 2013 at 3:03 AM, Adriana Farina
<ad...@gmail.com> wrote:

> Searching on google, I've found that it can be an issue due to /etc/hosts,
> but it's correctly configured:
>
> 127.0.0.1               crawler1a localhost.localdomain localhost
>
> where crawler1a is the master machine both for hadoop and for hbase.

Not being an HBase guru, my help here is limited at best.
1) Head over to HBase lists, this looks most likely nothing to do with Nutch.
2) I take it, that the thread you refer to is this one [0]? Which
suggests that your regionservers under hbase/conf and /etc/hosts
should be matched.

lewis

[0] http://stackoverflow.com/questions/12927997/nutch-2-1-cannot-setup-in-mac

-- 
Lewis